speaker-dependent audio-visual emotion recognition sanaul haq philip j.b. jackson centre for vision,...

Post on 27-Dec-2015

213 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Speaker-Dependent Audio-Visual Emotion Recognition

Sanaul HaqPhilip J.B. Jackson

Centre for Vision, Speech and Signal Processing, University of Surrey, UK

Speaker-Dependent Audio-Visual Emotion Recognition

Sanaul HaqPhilip J.B. Jackson

Outline

Introduction

Method

Audio-visual experiments

Summary

Introduction

Method

Audio-visual experiments

Summary

Overview

Speaker-Dependent Audio-Visual Emotion Recognition

Sanaul HaqPhilip J.B. Jackson

Outline

Introduction

Method

Audio-visual experiments

Summary

Introduction

Speaker-Dependent Audio-Visual Emotion Recognition

Sanaul HaqPhilip J.B. Jackson

Outline

Introduction

Method

Audio-visual experiments

Summary

Human-to-human interaction: emotions play an important role by allowing people to express themselves beyond the verbal domain.

Message: Linguistic information: textual content of a message Paralinguistic information: body language, facial expressions, pitch and tone of voice, health and identity. [Busso et al., 2007]

Audio based emotion recognition: Borchert et al., 2005; Lin and Wei, 2005; Luengo et al., 2005.

Visual based emotion recognition: Ashraf et al., 2007; Bartlett et al., 2005; Valstar et al., 2007.

Audio-visual based emotion recognition: Busso et al., 2004; Schuller et al., 2007; Song et al., 2004.

Recording of audio-visual database

Feature extraction

Feature selection

Feature reduction

Classification

Method

Speaker-Dependent Audio-Visual Emotion Recognition

Sanaul HaqPhilip J.B. Jackson

Outline

Introduction

Method

Audio-visual experiments

Summary

• Four English male actors recorded in 7 emotions: anger, disgust, fear, happiness, neutral, sadness and surprise.

• 15 phonetically-balanced TIMIT sentences per emotion: 3 common, 2 emotion specific and 10 generic sentences.

• Size of database: 480 sentences, 120 utterances per subject.

• For visual features, face was painted with 60 markers.

• 3dMD dynamic face capture system [3dMD capture system, 2009 ].

Audio-visual database

Speaker-Dependent Audio-Visual Emotion Recognition

Sanaul HaqPhilip J.B. Jackson

Outline

Introduction

Method

Audio-visual experiments

Summary

Figure: Subjects with expressions (from left): Displeased (anger, disgust), Gloomy (fear, sadness), Excited (happiness, surprise) and Neutral (neutral).

Audio features Pitch features: min., max., mean and standard deviation of Mel frequency and its first order difference. Energy features: min., max., mean, standard deviation and range of energies in speech signal and in the frequency bands 0-0.5 kHz, 0.5-1 kHz, 1-2 kHz, 2-4 kHz and 4-8 kHz. Duration features: voiced speech, unvoiced speech, sentence duration, voiced-to-unvoiced speech duration ratio, speech rate, voiced-speech-to-sentence duration ratio. Spectral features: mean and standard deviation of 12 MFCCs.

Visual features mean and standard deviation of 2D facial marker coordinates (60 markers).

Feature extraction

Speaker-Dependent Audio-Visual Emotion Recognition

Sanaul HaqPhilip J.B. Jackson

Outline

Introduction

Method

Audio-visual experiments

Summary

Figure: Audio feature extraction with Speech Filing System software (left), and video data (right) with overlaid tracked marker locations.

Feature extraction

Speaker-Dependent Audio-Visual Emotion Recognition

Sanaul HaqPhilip J.B. Jackson

Outline

Introduction

Method

Audio-visual experiments

Summary

Plus l-Take Away r algorithm [Chen, 1978]

• Combines both Sequential Forward Selection (SFS) and Sequential Backward Selection (SBS) algorithms.

• Criterion: Bhattacharyya distance

Feature selection

Speaker-Dependent Audio-Visual Emotion Recognition

Sanaul HaqPhilip J.B. Jackson

Outline

Introduction

Method

Audio-visual experiments

Summary

• Principal Component Analysis• Linear Discriminant Analysis

Feature reduction

Speaker-Dependent Audio-Visual Emotion Recognition

Sanaul HaqPhilip J.B. Jackson

Outline

Introduction

Method

Audio-visual experiments

SummaryWithin-class distance

Between-class distance LDAPCA

Gaussian classifier A Gaussian classifier uses Bayes decision theory where the class-

conditional probability density is assumed to have Gaussian distribution for each class . The Bayes decision rule

is described as

Classification

Speaker-Dependent Audio-Visual Emotion Recognition

Sanaul HaqPhilip J.B. Jackson

Outline

Introduction

Method

Audio-visual experiments

Summary

where is the posterior probability, is the prior class probability.

• Human recognition results provide a useful benchmark for audio, visual and audio-visual scenarios.

• 10 subjects: 5 native speaker, 5 non-native. • Half of the evaluators were female.• 10 different sets of the data were created by using Balanced

Latin Square method. [Edwards, 1962]

• Training: three facial expression pictures, two audio files, and a short movie clip for each of the emotion.• Displeased (anger, disgust), Gloomy (fear, sadness), Excited

(happiness, surprise) and Neutral (neutral).

Human evaluation experiments

Speaker-Dependent Audio-Visual Emotion Recognition

Sanaul HaqPhilip J.B. Jackson

Outline

Introduction

Method

Audio-visual experiments

Summary

• Audio• Visual• Audio-visual

Decision-level fusion• Overcomes the problem of different time scales and metric

levels of audio and visual data.• High dimensionality of the concatenated vector resulted in

case of feature-level fusion.• Haq, Jackson & Edge, 2008.

Busso et al., 2004; Wang et al., 2005; Zeng et al., 2007; Pal et al., 2006;

Experiments

Speaker-Dependent Audio-Visual Emotion Recognition

Sanaul HaqPhilip J.B. Jackson

Outline

Introduction

Method

Audio-visual experiments

Summary

Figure: Average recognition rate (%) over 4 actors with audio, visual, and audio-visual features for 7 emotion classes.

Results

Speaker-Dependent Audio-Visual Emotion Recognition

Sanaul HaqPhilip J.B. Jackson

Outline

Introduction

Method

Audio-visual experiments

Summary

Figure: Average recognition rate (%) over 4 actors with audio, visual, and audio-visual features for 4 emotion classes.

Results

Speaker-Dependent Audio-Visual Emotion Recognition

Sanaul HaqPhilip J.B. Jackson

Outline

Introduction

Method

Audio-visual experiments

Summary

7 Emotion Classes

Human KL JE JK DC Mean (±CI)

Audio 53.2 67.7 71.2 73.7 66.5 ± 2.5

Visual 89.0 89.8 88.6 84.7 88.0 ± 0.6

Audio-visual 92.1 92.1 91.3 91.7 91.8 ± 0.1

LDA 6 KL JE JK DC Mean (±CI)

Audio 35.0 62.5 56.7 70.8 56.3 ± 6.7

Visual 96.7 92.5 92.5 100 95.4 ± 1.6

Audio-visual 97.5 96.7 95.8 100 97.5 ± 0.8

PCA 10 KL JE JK DC Mean (±CI)

Audio 30.8 51.7 55.0 62.5 50.0 ± 6.0

Visual 91.7 87.5 89.2 98.3 91.7 ± 2.1

Audio-visual 92.5 86.7 94.2 98.3 92.9 ± 2.1

Results

Speaker-Dependent Audio-Visual Emotion Recognition

Sanaul HaqPhilip J.B. Jackson

Outline

Introduction

Method

Audio-visual experiments

Summary

Human KL JE JK DC Mean (±CI)

Audio 63.2 80.9 79.2 82.0 76.3 ± 2.4

Visual 90.6 97.2 90.0 87.4 91.3 ± 1.1

Audio-visual 94.4 98.3 93.5 94.5 95.2 ± 0.6

PCA 7 KL JE JK DC Mean (±CI)

Audio 50.0 58.3 65.8 72.5 61.7 ± 4.3

Visual 80.8 98.3 93.3 88.3 90.2 ± 3.3

Audio-visual 83.3 97.5 95.0 93.3 92.3 ± 2.7

LDA 3 KL JE JK DC Mean (±CI)

Audio 60.0 75.8 58.3 80.0 68.5 ± 4.8

Visual 97.5 100 94.2 100 97.9 ± 1.2

Audio-visual 97.5 100 94.2 100 97.9 ± 1.2

Results

Speaker-Dependent Audio-Visual Emotion Recognition

Sanaul HaqPhilip J.B. Jackson

Outline

Introduction

Method

Audio-visual experiments

Summary

4 Emotion Classes

Conclusions• For British English emotional database a recognition rate

comparable to human was achieved (speaker-dependent).• The LDA outperformed PCA with the top 40 features. • Both audio and visual information were useful for emotion

recognition.• Important audio features: Energy and MFCC.• Important visual features: mean of marker y-coordinate (vertical

movement of face).• Reason for higher performance of system: evaluators were not

given equal opportunity to exploit the data.

Future Work• To perform speaker-independent experiments, using more data

and other classifiers, such as GMM and SVM.

Summary

Speaker-Dependent Audio-Visual Emotion Recognition

Sanaul HaqPhilip J.B. Jackson

Outline

Introduction

Method

Audio-visual experiments

Summary

Ashraf, A.B., Lucey, S., Cohn, J.F., Chen, T., Ambadar, Z., Prkachin, K., Solomon, P. & Theobald, B.J. (2007). The Painful Face: Pain Expression Recognition Using Active Appearance Models, Proc. Ninth ACM Int'l Conf. Multimodal Interfaces, pp. 9-14.

Bartlett, M.S., Littlewort, G., Frank, M. et al. (2005). Recognizing Facial Expression: Machine Learning and Application to Spontaneous Behavior, Proc. IEEE Int'l Conf. Computer Vision and Pattern Recognition , pp. 568-573.

Borchert, M. & Düsterhöft, A. (2005). Emotions in Speech – Experiments with Prosody and Quality Features in Speech for Use in Categorical and Dimensional Emotion Recognition Environments, Proc. NLPKE, pp. 147–151.

Busso, C. & Narayanan, S.S. (2007). Interrelation between Speech and Facial Gestures in Emotional Utterances: A Single Subject Study. IEEE Trans. on Audio, Speech, and Language Processing, vol. 15, no. 8, pp. 2331-2347.

Busso, C., Deng, Z., Yildirim, S., Bulut, M., Lee, C.M., Kazemzadeh, A., Lee, S., Neumann, U. & Narayanan, S. (2004). Analysis of Emotion Recognition Using Facial Expressions, Speech and Multimodal Information, Proc. ACM Int'l Conf. Multimodal Interfaces , pp. 205-211.

Chen, C. (1978). Pattern Recognition and Signal Processing. Sijthoff & Noordoff, The Netherlands.

Edwards, A. L. (1962). Experimental Design in Psychological Research. New York: Holt, Rinehart and Winston.

3dMD 4D Capture System. Online: http://www.3dmd.com, accessed on 3 May, 2009.

References

Speaker-Dependent Audio-Visual Emotion Recognition

Sanaul HaqPhilip J.B. Jackson

Outline

Introduction

Method

Audio-visual experiments

Summary

Haq, S., Jackson, P.J.B. & Edge, J. (2008). Audio-visual feature selection and reduction for emotion classification, Proc. Auditory-Visual Speech Processing, pp. 185-190.

Lin, Y. & Wei, G. (2005). Speech Emotion Recognition Based on HMM and SVM, Proc. 4th Int’l Conf. on Mach. Learn. and Cybernetics, pp. 4898-4901.

Luengo, I., Navas, E. et al. (2005). Automatic Emotion Recognition using Prosodic Parameters, Proc. Interspeech, pp. 493-496.

Pal, P., Iyer, A.N. & Yantorno, R.E. (2006). Emotion Detection from Infant Facial Expressions and Cries, Proc. IEEE Int'l Conf. Acoustics, Speech and Signal Processing, vol. 2, pp.721-724.

Schuller, B., Muller, R., Hornler, B. et al. (2007). Audiovisual Recognition of Spontaneous Interest within Conversations, Proc. ACM Int'l Conf. Multimodal Interfaces , pp. 30-37.

Song, M., Bu, J., Chen, C. & Li, N. (2004). Audio-Visual-Based Emotion Recognition: A New Approach, Proc. Int'l Conf. Computer Vision and Pattern Recognition, pp. 1020-1025.

Valstar, M.F., Gunes, H. & Pantic, M. (2007). How to Distinguish Posed from Spontaneous Smiles Using Geometric Features, Proc. ACM Int’l Conf. Multimodal Interfaces, pp. 38-45.

Wang, Y. & Guan, L. (2005). Recognizing Human Emotion from Audiovisual Information, Proc. Int'l Conf. Acoustics, Speech, and Signal Processing, pp. 1125-1128.

Zeng, Z., Tu, J., Liu, M., Huang, T.S., Pianfetti, B., Roth, D. & Levinson, S. (2007). Audio-Visual Affect Recognition. IEEE Trans. Multimedia, vol. 9, no. 2, pp. 424-428.

References

Speaker-Dependent Audio-Visual Emotion Recognition

Sanaul HaqPhilip J.B. Jackson

Outline

Introduction

Method

Audio-visual experiments

Summary

Thank you for your attention!

R

Speaker-Dependent Audio-Visual Emotion Recognition

Sanaul HaqPhilip J.B. Jackson

Outline

Introduction

Method

Audio-visual experiments

Summary

top related