beyond basic emotions: expressive virtual actors with ...prosody = the style of speech (intonation)...

Beyond Basic Emotions: Expressive Virtual Actors with Social Attitudes

Adela Barbulescu, Remi Ronfard, Gerard Bailly, Georges Gagnere and Huseyin Cakmak

Expressive speech animation Talking heads Complex mental states

2

Purpose of study

Expressive speech animation Talking heads Complex mental states

How to encode mental states of a talking character?

3

Purpose of study

Basic emotions4

Paul Ekman, An argument for basic emotions, 1992

Emotions in speech animation

Bregler et al, Mood swings: expressive speech animation, 2005 Neutral, Happy and Angry

Busso at al, Rigid Head Motion in Expressive Speech Animation: Analysis and Synthesis, 2007 Neutral, Angry, Happy and Sad

Albrecht et al, Mixed feelings: Expression of non-basic emotions in a muscle-based talking head, 2005 24 categories (gloating, relief, pride, reproach etc)

5

[Bregler et al, 2005]

Motivation6

Virtual actors performing an expressive dialogue

file:///C:/work/Desktop/sub.avi

Prosody = the style of speech (intonation) Graf et al, Visual prosody: facial movements accompanying speech,

2002 Levine et al, Realtime prosody driven synthesis of body language,

2009 Social attitude (ex: comforting, ironic, doubtful etc)

Bolinger et al, Intonation and its uses: Melody in grammar and discourse, 1989

7

Theoretical terms

« How we feel when we say (emotions) and how we feel about what we say (attitudes) »

[Bolinger, 1989]

Study on prosodic features voice pitch speech rhythm head movements

Discrete set of social attitudes Approaches

distance metric perceptual evaluation tests

8

Our solution

Theoretical framework

Prosodic features present attitude-specific signatures depending on the length of sentences [Morlec, 2001]

9

F0 and head motion encoded by contours at sentence level: 1, 2, 5, 9, 11 and 13 syllables

Declarative Question Disbelieving

file:///C:/work/Dropbox/vids/cnt_1.avifile:///C:/work/Dropbox/vids/cnt_3.avifile:///C:/work/Dropbox/vids/cnt_10.avi

Faceshift 1 director + 2 actors 35 identical phrases 13 attitudes from Mind Reading [Baron Cohen, 2004]

+ Declarative, Interrogative, Exclamative

10

Expressive corpus

Attitudes in corpus11

Declarative Interrogative Fond-likingComforting Seductive Fascinated

Jealous Thinking Disbelieving Sarcastic Scandalized Dazed

11

Prosodic representation

Voice pitch: F0 extraction Head movements: extracted using faceshift Rhythm: duration factor using elastic syllable model Used in computing a distance metric

12

Segmentation and annotation with Praat

Prosodic representation

Voice pitch: F0 extraction Head movements: extracted using faceshift Rhythm: duration factor using elastic syllable model Used in computing a distance metric

13

Segmentation and annotation with Praat

3 values / syllable (at 10, 50 and 80% of vocalic nucleus)

- 1 value / syllable

Distance metric (data analysis) Inter-class distances based on the prosodic features

Perceptual evaluation Create 3 types of material Carry perceptual tests for each type of material

Results comparison

14

Evaluation paradigm

Objective evaluation

Euclidian distances for equal-sized sentences Normalized F0 PCA components of rotation and translation

K-nearest neighbor framework Results to be compared with those of perceptual tests

15

1 declarative 2

exclamative

3 interrogation

4 comforting 5 fond-liking 6 seductive 7 fascinated 8 jealous 9 thinking10

incredulous

11 sarcastic12

scandalised

13 dazed14

responsable

15 hurt16

embarassed

Audio-only Visual-only Audio-visual

15

30.60 % 24.14 % 40.95 %

Original audio and original video

16

Material 1

Sarcastic Fascinated Fond-liking

file:///C:/work/Dropbox/test/2_3.mp4file:///C:/work/Dropbox/vids/8_3.wmvfile:///C:/work/Dropbox/vids/16_3.wmv

Original audio and motion capture Animation platform

17


Material 2

file:///C:/work/Dropbox/test2/data3/30_11.mp4file:///C:/work/Dropbox/vids/27_7_2.wmvfile:///C:/work/Dropbox/vids/27_5_2.wmv

Resynthesis of prosody Head motion, pitch and rhythm from expressive performance Add other params from neutral performance

18

Material 3


file:///C:/work/Dropbox/test3/data3/30_11.mp4file:///C:/work/Dropbox/vids/27_7_9.wmvfile:///C:/work/Dropbox/vids/27_5_3.wmv

Material 3

Audio resynthesis TD-PSOLA (Time-Domain Pitch-Synchronous Overlap and Add) Move, delete or duplicate short-time signals

19

Analysed speech

Synthesized speech

Material 3

Visual resynthesis DTW from neutral to expressive

(Dynamic Time Warping) Cubic spline interpolation for translations Quaternion interpolation for rotations

20

Rotation – quaternion x Translation x

Auto-evaluation

Material 1 (Original audio and video) 3 participants Best results: female actor

1 declarative 2

exclamative

3 interrogation


incredulous

11 sarcastic12

scandalised

13 dazed14

responsable

15 hurt16

embarassed


21

78.12 % 81.25 % 78.12 %

User study: Material 1

Material 1 (Original audio and video) 84 participants

1 declarative 2

exclamative

3 interrogation


incredulous

11 sarcastic12

scandalised

13 dazed14

responsable

15 hurt16

embarassed


22

30.98 % 35.47 % 36.90 %


Material 2 (Original audio and motion capture) 42 participants

1 declarative 2

exclamative

3 interrogation


incredulous

11 sarcastic12

scandalised

13 dazed14

responsable

15 embarassed


23

26.00 % 17.73 % 31.72 %


Material 3 (Resynthesis of prosody) 13 participants

1 declarative 2

responsable

3 embarassed


incredulous

11 sneaky12

scandalised

13 dazed


24

15.58 % 11.65 % 16.96 %

Material 1: Results per attitude25

Declarative Interrogative Fond-likingComforting Seductive

Fascinated Jealous Thinking Disbelieving Sarcastic Scandalized

Dazed EmbarassedResponsible

Exclamative

Hurt

0.170 0.066 0.788 0.290 0.368 0.761

0.391 0.167 0.814 0.214 0.640 0.264

0.357 0.273 0.150 0.617


Declarative Interrogative Fond-likingComforting

Seductive Fascinated Jealous Thinking Disbelieving

Sarcastic Scandalized Dazed EmbarassedResponsible

Exclamative

0.235 0.128 0.602 0.253 0.372

0.550 0.286 0.227 0.424 0.181

0.313 0.554 0.219 0.040 0.540


Declarative Fond-likingComforting Seductive Fascinated

Jealous Thinking Disbelieving Sarcastic

Scandalized Dazed EmbarassedResponsible

0.124 0.105 0.143 0.091 0.231

0.067 0.379 0.135 0.222

0.333 0.200 0.000 0.192

Average results and conclusions

Test type Audio-only Visual-only Audio-visual

Auto-evaluation 62.50% 73.96% 73.96%

Material 1 30.98% 35.47% 36.90%

Material 2 26.00% 17.73% 31.72%

Material 3 15.58% 11.65% 16.96%

Objective 30.60% 24.14% 40.95%

28

Generally better results for Audio-Visual Rates for video-only decrease as animation is used Results of objective and subjective tests are comparable for Audio-only F0 and speech rhythm present discriminant signatures Head movement is not sufficient

Improve retargetting (Material 2) Improve resynthesis (Material 3)

Blend prosodic and non-prosodic features Learn prosodic signatures

Investigate other prosodic features Audio: intensity Video: eyebrow movements, eye gaze, eye blink

Generate expressive dialogue

29

Future work

30

Thank you!

Slide 1Purpose of studySlide 3Basic emotionsEmotions in speech animationMotivationTheoretical termsOur solutionTheoretical frameworkExpressive corpusAttitudes in corpusProsodic representationSlide 13Evaluation paradigmObjective evaluationMaterial 1Material 2Slide 18Slide 19Slide 20Auto-evaluationUser study: Material 1User study: Material 2User study: Material 3Material 1: Results per attitudeMaterial 2: Results per attitudeMaterial 3: Results per attitudeAverage results and conclusionsFuture workSlide 30

beyond basic emotions: expressive virtual actors with ...prosody = the style of speech (intonation)...

Documents