[ieee 2013 2nd iapr asian conference on pattern recognition (acpr) - naha, japan...

Emotional Speech Recognition Using Acoustic Models of Decomposed Component Words

Vivatchai Kaveeta, Karn Patanukhom Visual Intelligence and Pattern Understanding Laboratory

Department of Computer Engineering Chiang Mai University, Chiang Mai, Thailand [email protected], [email protected]

Abstract—This paper presents a novel approach for emotional speech recognition. Instead of using a full length of speech for classification, the proposed method decomposes speech signals into component words, groups the words into segments and generates an acoustic model for each segment by using features such as audio power, MFCC, log attack time, spectrum spread and segment duration. Based on the proposed segment-based classification, unknown speech signals can be recognized into sequences of segment emotions. Emotion profiles (EPs) are extracted from the emotion sequences. Finally, speech emotion can be determined by using EP as features. Experiments are conducted by using 6,810 training samples and 722 test samples which are composed of eight emotional classes from IEMOCAP database. In comparison with a conventional method, the proposed method can improve recognition rate from 46.81% to 58.59% in eight emotion classification and from 60.18% to 71.25% in four emotion classification.

Keywords-emotional classification; speech emotion; SVM

I. INTRODUCTION In recent years, Emotional Speech Recognition (ESR)

has been widely studied. ESR plays an important role in human-computer interaction systems, providing natural interface that potentially replace traditional input devices. By analyzing the acoustic properties of spoken context, it can determine emotion states of speakers. The important applications consist of decision support in commercial or educational environment, mobile communications, user opinion mining, and emotion recognition in movie scenes.

Most of ESR procedure begins with feature extraction. Audio features such as fundamental frequency (f0), mel-frequency cepstral coefficients (MFCC), log attack time are frequently used for classification [1]. Recent studies in [2], [3] proposed hybrid methods which combine other sources of features such as transcript text, facial movement, and body orientation. K-Nearest Neighborhood (KNN), Support Vector Machine (SVM), Hidden Markov Model (HMM) and Gaussian Mixture Model (GMM) are frequently used to classify the speech emotion [4].

Instead of using features from entire speech signal for emotion classification like conventional methods, we propose to segment the speech waveform into multiple segments and extract acoustic features of every segment. Extracted features are classified by using segment classification models, resulting in emotion sequences of the entire speech. We also develop methods for combining segment emotion states to obtain a whole speech emotion

using unigram and bigram models with voting scheme and SVM classifier.

This paper is organized as following: Section II presents some backgrounds on emotion. In Section III, the proposed system is described in detail. The experiments are conducted and the performance is analyzed in Section IV. Section V gives conclusion of this work and suggests future study.

II. EMOTIONS Emotion is a subjective term characterizing physical

reaction and state of mind. It is associated with personality, motivation, and mood. Emotional responses are expressed through many signals such as heart rate, facial and vocal expressions. Emotions can be categorized by using crisp sets of emotion classes such as happy, sad, angry, and excite or using simple binary classes of negative and non-negative emotions. Some well-known sets of emotions include Ekman's list of six basic emotions [5] (anger, disgust, fear, happiness, sadness, and surprise), Plutchik’s wheel of emotions [6], and Parrott’s emotion tree [7]. Plutchik’s wheel of emotions consists of eight primary emotions (anger, fear, sadness, disgust, surprise, anticipation, trust, and joy) and eight secondary emotions. Each secondary emotion is composed of two primary ones such as love (joy + trust) and disapproval (surprise + sadness). On the other hand, Parrott’s classification model over 100 emotions in a tree structure.

Other approaches in [8] were proposed to describe the emotions as the combination of multiple primary emotion classes. Instead of hard labeled emotions into classes, emotions can be explained using a fuzzy set system. Emotions are described as combinations of multiple continuously valued parameters. In [9], [10], valence, activation and dominance are used as three parameters to describe the emotions. Crisp set of emotion classes can be described by these three parameters. For example, “anger” is parameterized by low valence, high activation and high dominance. Another way to describe emotion is by detailing the presence or absence of each basic emotion label. This set

...

Primary emotion

Occurrences

Figure 1. Example of emotion profile of four basic emotions.

2013 Second IAPR Asian Conference on Pattern Recognition

978-1-4799-2190-4/13 $26.00 © 2013 IEEE

DOI 10.1109/ACPR.2013.13

115


978-1-4799-2190-4/13 $26.00 © 2013 IEEE

DOI 10.1109/ACPR.2013.13

115


978-1-4799-2190-4/13 $26.00 © 2013 IEEE

DOI 10.1109/ACPR.2013.13

115


978-1-4799-2190-4/13 $31.00 © 2013 IEEE

DOI 10.1109/ACPR.2013.13

115

of multiple labels is called Emotion Profiles (EPs) [11]. EPs measure degree of emotions in the data. Based on [11], EPs can be determined by confident levels of SVM classifiers. Example of EP is shown in Fig. 1. The distributions of EPs can be used to map the connection from basic emotion to more complex ones. In this work, EPs are used as descriptors for classification while final results of speech emotion recognition are described by using the hard-labeled emotions which is convenient for evaluation and comparison.

III. EMOTIONAL SPEECH RECOGNITION SYSTEM In this paper, we introduce two new techniques for ESR

system. The first technique is a segment-based classification combining with unigram and bigram language models. For the other technique, we propose a new approach for EP extraction and use it as descriptor for emotion classification of speech.

A. System Overview The objective of ESR is to interpret the emotion classes

from the audio signals of human speech. The proposed ESR scheme consists of six components as shown in Fig. 2. Word segmentation is the first component operate by, firstly, segmenting the input audio signal into word segments according to the characteristics of its waveform envelope and a corresponding transcription text script. Starting and ending time indices of each word can be obtained after this process. For the second step, every word segment is decomposed into multiple frames of 30-millisecond length with no overlap. The detail of word segmentation is presented in Section III-B. The third step is to extract audio acoustic descriptors from the frames. Each frame produces a result of 58-dimensional feature vector. Then, in the fourth step, frames are grouped into words (unigram-based segment), pairs of adjacent words (bigram-based segment), and whole speech segment using the starting and ending time indices obtained in the word segmentation process. Statistical moments of frame-wise features (FWFs) are extracted for every segment and used as segment-wise features (SWFs). By including some acoustic temporal features, we can obtain a 187-dimentional SWF vector for each segment. The details of FWF and SWF

extraction are given in Section III-C. After feature vectors of every segment have been computed, the next step is to recognize the emotion class of the segments. The segment emotion classification process is described in Section III-D. In order to obtain the emotion class of given speech, SVM and voting schemes are used to analyze the emotion sequence that is obtained from every segment-wise emotion and classify the sequence into final emotion state. The emotion sequence classification is presented in Section III-E.

B. Word Segmentation In order to segment audio files into separated word

segments. The input speech is filtered to generate a signal envelope. Local minimum points of signal envelope are extracted. Pairs of adjacent minimum points are used as starting and ending time indices for segments of the word components. Fig. 3 shows an example of word segmentation of the speech “excuse me”. The separation of waveform between words can be clearly seen at the local minimum point of signal envelope.

If transcription script of the speech is available, there are possibilities that the number of segmented words is not equal to the number of words in the transcription data. In case that the number of segmented words is greater than the number of transcription words, the system combines pairs of adjacent segments with the shortest length. When the number of segmented words is less than the number of transcription words, the system splits the longest segments into two words. The process is repeated until the number of segments for speech is identical to the transcription data. In real world situations, the transcription texts can be obtained from many sources such as movie subtitles or speech recognition system.

C. Feature Extraction The feature extraction process can be considered as two

levels of features which are FWFs and SWFs. The purpose of the feature extraction process is to extract acoustic features from segmented frames and to generate final descriptors for speech segments in terms of unigram, bigram and whole speech segments.

... ...

... ... ... ...

Word

Word Segmentation

Acoustic Feature

Extraction

Frame Segmentation

Word

Word

30 ms Frame-wise Features

58 ... ... ...

Emotion Sequence

Classification

Segment-wise Feature

Extraction

AUnigram Segments

187

Bigram Segments

Whole Speech Segments

Segment Emotion

Classification

...

...

...

Emotion Sequence

...

...

...

AAB

CA

A

Segment-wise Features

Speech Signal

Figure 2. Overview of the proposed system.

116116116116

1) Frame-wise Features (FWFs): After the speech signals have been segmented into words, to extract the acoustic features every word is decomposed into frames using 30-ms window without overlap. The acoustic features from MPEG-7 standard [12] which are given in the Table I are extracted for every acoustic frames. )( jfk denotes for the k -th component of FWFs of the j -th frame where

43,...,2,1=k . 2) Segment-wise Features (SWFs): In this work,

the unigram segments, the bigram segments, and the whole speech segments are proposed as three types of speech segments. The unigram segments are defined as segments of every single word while the bigram segments are defined as segments of every pair of adjacent words. The SWFs consist of two components. The first component kF is computed from four statistical moments of every FWF kf as

[ ])()()()()( 22 jjjjjF kkkkk σμσμ ΔΔ= . (1)

�=+−

=)(

2

)(1

)(1

1)( )(1

)(2

j

j

T

Tikjjk if

TTjμ . (2)

( )�=

−+−

=)(

2

)(1

2)(

1)(

2

2 )()(1

1)(j

j

T

Tikkjjk jif

TTj μσ . (3)

( )�−

=

−+−

=Δ1

)(1

)(2

)(2

)(1

)()1(1)(j

j

T

Tikkjjk ifif

TTjμ . (4)

( )�−

=

Δ−−+−

=Δ1

2)(

1)(

2

2)(

2

)(1

)()()1(1)(j

j

T

Tikkkjjk jifif

TTj μσ . (5)

where )( jFk is a four-dimensional vector of the j -th speech segment which is computed from the k -th component of FWFs. kμ , 2

kσ , kμΔ , and 2kσΔ are mean,

variance, mean of difference, and variance of difference of the FWF .kf )(

1jT and )(

2jT represent the first and the last

frame of the j -th segment. The second component of SWFs denoted as kG is a

group of temporal acoustic features listed in Table II. After combining two components, the segment-wise descriptor can be written as

])()()()([)( 151431 jGjGjFjFjS ��= . (6)

where )( jS is 187-dimensional feature vectors for the j -th speech segment.

D. Segment Emotion Classification The speech segments with their corresponding

descriptors )( jS obtained in Section III-C are classified by using acoustic models that are trained by multi-class SVMs with one-against-all strategy. In this work, two types of acoustic models called general segment model and particular segment model are trained by using training data.

1) General Segment Models (GSMs): GSMs are defined as acoustic models for each type of segment (unigram, bigram or whole speech segment) which are generated by using all available SWFs in that type of segment. For example, every segment-wise descriptor obtained from all unigram segments are used to create a general unigram segment model. In this work, three GSMs from all of three segment types are generated in training process.

2) Particular Segment Models (PSMs): PSMs are defined as acoustic models for each particular segment of word or pair of words. PSMs are created to classify specific segments. For unigram segments, each of particular unigram segment model is generated by training all SWFs of those words which appear in the training data. For instance, based on known transcription text, PSM of a word “love” is trained from every segment-wise descriptor extracted from every speech segment of the word “love”. In this work, PSMs for both unigram and bigram segments are generated.

Fig. 4 demonstrates the segment emotion classification scheme. In the recognition process, unigram and bigram segments which present in PSMs are classified by their

(a) (b) (c)

Figure 3. Word segmentation of “excuse me” audio, (a) speech signal, (b) signal envelope, (c) segmented signal.

TABLE I. ACOUSTIC FRAME-WISE FEATURES

Group Feature name Size Energy Audio Power (AP) 1

Harmonic Audio Fundamental Frequency (AFF) 1

Perceptual Total Loudness (NTOT) 1 Sonogram (SONE) 8

Spectral

Audio Spectrum Centroid (ASC) 1 Audio Spectrum Roll off (ASR) 1 Audio Spectrum Spread (ASS) 1 Mel Frequency Cepstrum Coefficient (MFCC) 24 Audio Spectrum Flatness (ASF) 4

Temporal Zero Crossing Rate (ZCR) 1

TABLE II. ACOUSTIC SEGMENT-WISE FEATURES

Group Feature name Size

Temporal

Auto-correlation (AC) 12 Temporal Centroid (TC) 1 Log Attack Time (LAT) 1 Duration (DU) 1

117117117117

corresponding models. In the case that PSM of the target segment does not appear, GSMs are used instead. Because whole speech segments do not have PSM, whole speech features are always classified by using GSM.

E. Emotion Sequence Classification Emotions from every segment are concatenated to

generate an observation vector O . For the system with three types of segments, the observation vector can be generated as

][ )()()(1

)()(1

wholebim

biunin

uni eeeeeO ��= where )(Tje

denotes for predicted emotion for the j -th speech segment in the segment type of T . n and m are the number of unigram and bigram segments in speech, respectively.

1) Emotion Profiles: In this work, we propose to modify EP from the original one in [11] by using a histogram of emotions in the observation vector as

�=i

jiwL

jEP ),(1)( . (7)

�� =

=otherwise;0

)(;1),( jciO

jiw . (8)

where jc is the j -th class of emotion and L is length of O . The EPs which are calculated from (7) and (8) are called unweighted emotion profiles (UEPs). The UEPs are generated without using statistical information on language model that is the probability of emotions for the given speech segment. Examples of conditional probabilities of emotion states for some given speech segments are demonstrated in Table III. Each particular segment has its own emotion probabilities which are generated from training data. For example, Table III shows that the word “supervisor” has probabilities of 0.27 and 0.73 for neutral and anger emotions,

respectively, and no probability for sadness or happiness emotions. In consideration of language model, the weighted emotion profiles (WEPs) are proposed. The WEPs can be obtained by modifying the definition of weight w from UEP definition in (8) to a new definition in (9) where )( SEP is a conditional probability of emotion E for given speech segment S and is denotes a transcription of the i -th speech segment.

�� ====

otherwise;0)(;))((),( ji ciOsSiOEPjiw . (9)

2) Classification Method: In order to compare the performances, EP features extracted from emotion sequence are classified by using two classification schemes which are SVM and voting scheme. Multi-class SVM is trained and used for classification by using EPs as fecture vectors. On the other hand, voting scheme simply chooses the emotion with the highest value in an EP as the final result.

IV. EXPERIMENTAL RESULTS In this work, the Interactive Emotional Dyadic Motion

Capture (IEMOCAP) database [13] provided by SAIL lab at the University of Southern California is used. It contains videos, speeches, motion captures of face, and text transcriptions collected from five males and five females. Emotion is evaluated by three evaluators. Only audio files that have emotions with majority vote are selected, the others are labeled as obscure data and excluded from this experiments. 10% of audio files (722 samples) are randomly selected as test data, and the other 90% (6,810 samples) are used for training.

The experiments are performed on data sets of eight and four emotion states. The eight emotion states include neutral (N), frustration (Fr), anger (A), surprise (Sp), sadness (S), fear (F), happiness (H) and excited (E). On the other hand, set of four emotion states is used for comparing the performance of proposed method to existing work in [14]. Based on the classes of emotion states divided in [14], four emotion states of N, A, S, and HE are used where HE denotes fusion class of happiness and excited states. From training data, 3,483 unigram and 21,983 bigram segments are used to generate PSMs as acoustic models and conditional probabilities as language models. The evaluation is performed on both stages of proposed scheme via recognition rates. In Section IV-A, recognition rates of the proposed segment emotion classification for every segment type and acoustic model are analyzed. Then, emotion sequence classification is tested on many combinations of descriptors, acoustic models and classifiers in Section IV-B.

A. Accuracy of Segment Emotion Classification In this section, recognition rates of GSMs and PSMs for

each type of segment are analyzed. Results are shown in Table IV. The results show that the acoustic models obtained in four-emotion dataset tend to have the higher recognition

UnigramFeatures

BigramFeatures

Whole Speech Features

Particular model

General model

Particular model

General model

General model

A

B

C

D

E

In particular model?

In particular model?

Y

N

Y

N

Figure 4. Segment emotion classification scheme.

TABLE III. EXAMPLE OF CONDITIONAL PROBABILITY OF EMOTIONS

Segments Neutral Anger Sadness Happiness “love” 0.10 0.04 0.15 0.71 “me” 0.15 0.35 0.19 0.31

“supervisor” 0.27 0.73 0 0 “just you” 0.29 0.14 0.43 0.14

118118118118

rate than in eight-emotion dataset. Although, PSMs have less accuracy than GSMs in the case of unigram segments, they have more accuracy than GSMs in the case of bigram segments. The results also show that the better recognition rates can be obtained when the lengths of segments increase.

B. Accuracy of Emotion Sequence Classification Recognition rates of the final results of the proposed

method are investigated in this section. In the case of eight-emotion data, the method which provides highest recognition rate of 58.59% is a combination of three types of segments, PSM, WEP, and voting scheme. In the case of four-emotion data, the best combination with recognition rate of 71.25% is by using three types of segments, PSM, UEP, and SVM classifier. Table V shows that the recognition rates can be improved by combining more types of segments. For the experimental result, it is clear that PSMs tend to provide the better performance than GSMs and conventional methods.

In this experiment, the proposed method is also compared to the previous work in [14] and SVM baseline. SVM baseline is a whole speech classification without unigram and bigram. The same acoustic feature set as in proposed method is used. The results in Table V show that our proposed method can provide higher recognition rate than [14] and SVM baseline.

V. CONCLUSIONS The new approach for ESR system is proposed. The

proposed scheme divides speech waveforms into word components, groups the word components into segments, extracts features for every speech segment, classifies the speech segments separately, and combines result from every segment into final emotion state. The experimental results on dataset of four emotions show that the proposed method can provide 15.37% improvement of recognition rate in comparison to the conventional method. For the future works, not only language models but more components such as confidential levels of the segment emotion classifiers will be considered to determine the weights of WEP.

ACKNOWLEDGMENT We would like to express our appreciation to Prof.

Shrikanth Narayanan and Ms. Angeliki Metallinou, SAIL lab for giving us access to IEMOCAP database.

REFERENCES [1] A.S. Lampropoulos, and G.A. Tsihrintzis, "Evaluation of MPEG-7

Descriptors for Speech Emotional Recognition," The 8th International Conference on Intelligent Infomation Hiding and Multimedia Signal Processing, pp.98-101, 2012.

[2] E. Mower, M.J. Mataric, and S. Narayanan, "A Framework for Automatic Human Emotion Classification Using Emotion Profiles," IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 5, pp.1057-1070, 2011.

[3] T. Pao, Y. Chen, and J. Yeh, "Comparison of Classification Methods for Detecting Emotion from Mandarin Speech," IEICE Transactions on Information and Systems, vol. 91, no. 4, pp.1074-1081, 2008.

[4] D. Neiberg, K. Elenius, and K. Laskowski. "Emotion Recognition in Spontaneous Speech using GMMs," Proc. Interspeech, vol. 6, 2006.

[5] P. Ekman, "Are there basic emotions," Psychological review, pp.550-553, 1992.

[6] R. Plutchik, "Emotion: A Psychoevolutionary Synthesis," New York, Harper & Row, 1980.

[7] W.G. Parrott, "Emotions in Social Psychology: Key Readings," Ann Arbor: Edward, 2001.

[8] T. Bänziger, V. Tran, and K.R. Scherer, "The Geneva Emotion Wheel: A tool for the verbal report of emotional reactions," International Society for Research on Emotion, 2005.

[9] M. Grimm, and K. Kroschel, "Emotion Estimation in Speech Using a 3D Emotion Space Concept," Robust Speech Recognition and Understanding, pp.281-300, 2007.

[10] J.A. Russell, and A. Mehrabian. "Evidence for a Three-Factor Theory of Emotions," Journal of Research in Personality, vol. 11, no. 3, pp.273-294, 1977.

[11] E.M. Provost, and S. Narayanan, "Simplifying Emotion Classification Through Emotion Distillation," Signal & Information Processing Association Annual Summit and Conference, pp.1-4, 2012.

[12] H. Kim, N. Moreau, and T. Sikora, "MPEG-7 Audio and Beyond: Audio Content Indexing and Retrieval," Wiley, 2006.

[13] C. Busso, M. Bulut, C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J.N. Chang, S.Lee, and S.S. Narayanan, "IEMOCAP: Interactive Emotional Dyadic Motion Capture Database," Language resources and evaluation, vol. 42, no. 4, pp.335-359, 2008.

[14] M. Li, A. Metallinou, D. Bone, and S. Narayanan, "Speaker States Recognition Using Latent Factor Analysis Based Eigenchannel Factor Vector Modeling," IEEE International Conference on Acoustics, Speech and Signal Processing, pp.1937-1940, 2012

TABLE IV. ACCURACY OF SEGMENT EMOTION CLASSIFICATION

Segment Acoustic Model

Accuracy 8 Emotions 4 Emotions

Unigram GSM 39.82 % 46.45 % PSM 39.42 % 40.20 %

Bigram GSM 40.59 % 48.32 % PSM 45.00 % 54.22 %

Whole speech GSM 46.81 % 60.18 %

TABLE V. ACCURACY OF EMOTION SEQUENCE CLASSIFICATION

Method Segment Feature

Acoustic Model EPs Classifier 8 Emo. 4 Emo.

M. Li [14] - - - GMM - 55.88 % Baseline Whole speech GSM - SVM 46.81 % 60.18 %

Proposed

Unigram

GSM WEP SVM 45.15 % 57.66 %

Vote 44.18 % 52.61 %

UEP SVM 39.34 % 50.08 % Vote 40.72 % 49.76 %

PSM WEP SVM 54.57 % 57.66 %

Vote 55.96 % 54.82 %

UEP SVM 52.49 % 54.82 % Vote 52.35 % 47.24 %

Unigram +

Bigram

GSM WEP SVM 46.54 % 60.98 %

Vote 44.88 % 57.35 %

UEP SVM 37.53 % 50.40 % Vote 40.86 % 52.44 %

PSM WEP SVM 57.20 % 66.19 %

Vote 57.76 % 63.03 %

UEP SVM 56.65 % 67.93 % Vote 56.65 % 60.19 %

Unigram +

Bigram +

Whole speech

GSM WEP SVM 48.75 % 58.29 %

Vote 45.29 % 57.82 %

UEP SVM 43.21 % 53.71 % Vote 41.41 % 53.87 %

PSM WEP SVM 56.78 % 70.46 %

Vote 58.59 % 65.40 %

UEP SVM 57.89 % 71.25 %Vote 56.51 % 63.03 %

119119119119

[ieee 2013 2nd iapr asian conference on pattern recognition (acpr) - naha, japan...

Documents