beyond basic emotions: expressive virtual actors with ...prosody = the style of speech (intonation)...

30
Beyond Basic Emotions: Expressive Virtual Actors with Social Attitudes Adela Barbulescu, Remi Ronfard, Gerard Bailly, Georges Gagnere and Huseyin Cakmak

Upload: others

Post on 13-Feb-2021

4 views

Category:

Documents


0 download

TRANSCRIPT

  • Beyond Basic Emotions: Expressive Virtual Actors with Social Attitudes

    Adela Barbulescu, Remi Ronfard, Gerard Bailly, Georges Gagnere and Huseyin Cakmak

  • Expressive speech animation Talking heads Complex mental states

    2

    Purpose of study

  • Expressive speech animation Talking heads Complex mental states

    How to encode mental states of a talking character?

    3

    Purpose of study

  • Basic emotions4

    Paul Ekman, An argument for basic emotions, 1992

  • Emotions in speech animation

    Bregler et al, Mood swings: expressive speech animation, 2005 Neutral, Happy and Angry

    Busso at al, Rigid Head Motion in Expressive Speech Animation: Analysis and Synthesis, 2007 Neutral, Angry, Happy and Sad

    Albrecht et al, Mixed feelings: Expression of non-basic  emotions in a muscle-based talking head, 2005 24 categories (gloating, relief, pride, reproach etc)

    5

    [Bregler et al, 2005]

  • Motivation6

    Virtual actors performing an expressive dialogue

    file:///C:/work/Desktop/sub.avi

  • Prosody = the style of speech (intonation) Graf et al, Visual prosody: facial movements accompanying speech,

    2002 Levine et al, Realtime prosody driven synthesis of body language,

    2009 Social attitude (ex: comforting, ironic, doubtful etc)

    Bolinger et al, Intonation and its uses: Melody in grammar and discourse, 1989

    7

    Theoretical terms

    «  How we feel when we say (emotions) and how we feel about what we say (attitudes) »

    [Bolinger, 1989]

  • Study on prosodic features voice pitch speech rhythm head movements

    Discrete set of social attitudes Approaches

    distance metric perceptual evaluation tests

    8

    Our solution

  • Theoretical framework

    Prosodic features present attitude-specific signatures depending on the length of sentences [Morlec, 2001]

    9

    F0 and head motion encoded by contours at sentence level: 1, 2, 5, 9, 11 and 13 syllables

    Declarative Question Disbelieving

    file:///C:/work/Dropbox/vids/cnt_1.avifile:///C:/work/Dropbox/vids/cnt_3.avifile:///C:/work/Dropbox/vids/cnt_10.avi

  • Faceshift 1 director + 2 actors 35 identical phrases 13 attitudes from Mind Reading [Baron Cohen, 2004]

    + Declarative, Interrogative, Exclamative

    10

    Expressive corpus

  • Attitudes in corpus11

    Declarative Interrogative Fond-likingComforting Seductive Fascinated

    Jealous Thinking Disbelieving Sarcastic Scandalized Dazed

    11

  • Prosodic representation

    Voice pitch: F0 extraction Head movements: extracted using faceshift Rhythm: duration factor using elastic syllable model Used in computing a distance metric

    12

    Segmentation and annotation with Praat

  • Prosodic representation

    Voice pitch: F0 extraction Head movements: extracted using faceshift Rhythm: duration factor using elastic syllable model Used in computing a distance metric

    13

    Segmentation and annotation with Praat

    3 values / syllable (at 10, 50 and 80% of vocalic nucleus)

    - 1 value / syllable

  • Distance metric (data analysis) Inter-class distances based on the prosodic features

    Perceptual evaluation Create 3 types of material Carry perceptual tests for each type of material

    Results comparison

    14

    Evaluation paradigm

  • Objective evaluation

    Euclidian distances for equal-sized sentences Normalized F0 PCA components of rotation and translation

    K-nearest neighbor framework Results to be compared with those of perceptual tests

    15

    1 declarative 2

    exclamative

    3 interrogation

    4 comforting 5 fond-liking 6 seductive 7 fascinated 8 jealous 9 thinking10

    incredulous

    11 sarcastic12

    scandalised

    13 dazed14

    responsable

    15 hurt16

    embarassed

    Audio-only Visual-only Audio-visual

    15

    30.60 % 24.14 % 40.95 %

  • Original audio and original video

    16

    Material 1

    Sarcastic Fascinated Fond-liking

    file:///C:/work/Dropbox/test/2_3.mp4file:///C:/work/Dropbox/vids/8_3.wmvfile:///C:/work/Dropbox/vids/16_3.wmv

  • Original audio and motion capture Animation platform

    17

    Sarcastic Fascinated Fond-liking

    Material 2

    file:///C:/work/Dropbox/test2/data3/30_11.mp4file:///C:/work/Dropbox/vids/27_7_2.wmvfile:///C:/work/Dropbox/vids/27_5_2.wmv

  • Resynthesis of prosody Head motion, pitch and rhythm from expressive performance Add other params from neutral performance

    18

    Material 3

    Sarcastic Fascinated Fond-liking

    file:///C:/work/Dropbox/test3/data3/30_11.mp4file:///C:/work/Dropbox/vids/27_7_9.wmvfile:///C:/work/Dropbox/vids/27_5_3.wmv

  • Material 3

    Audio resynthesis TD-PSOLA (Time-Domain Pitch-Synchronous Overlap and Add) Move, delete or duplicate short-time signals

    19

    Analysed speech

    Synthesized speech

  • Material 3

    Visual resynthesis DTW from neutral to expressive

    (Dynamic Time Warping) Cubic spline interpolation for translations Quaternion interpolation for rotations

    20

    Rotation – quaternion x Translation x

  • Auto-evaluation

    Material 1 (Original audio and video) 3 participants Best results: female actor

    1 declarative 2

    exclamative

    3 interrogation

    4 comforting 5 fond-liking 6 seductive 7 fascinated 8 jealous 9 thinking10

    incredulous

    11 sarcastic12

    scandalised

    13 dazed14

    responsable

    15 hurt16

    embarassed

    Audio-only Visual-only Audio-visual

    21

    78.12 % 81.25 % 78.12 %

  • User study: Material 1

    Material 1 (Original audio and video) 84 participants

    1 declarative 2

    exclamative

    3 interrogation

    4 comforting 5 fond-liking 6 seductive 7 fascinated 8 jealous 9 thinking10

    incredulous

    11 sarcastic12

    scandalised

    13 dazed14

    responsable

    15 hurt16

    embarassed

    Audio-only Visual-only Audio-visual

    22

    30.98 % 35.47 % 36.90 %

  • User study: Material 2

    Material 2 (Original audio and motion capture) 42 participants

    1 declarative 2

    exclamative

    3 interrogation

    4 comforting 5 fond-liking 6 seductive 7 fascinated 8 jealous 9 thinking10

    incredulous

    11 sarcastic12

    scandalised

    13 dazed14

    responsable

    15 embarassed

    Audio-only Visual-only Audio-visual

    23

    26.00 % 17.73 % 31.72 %

  • User study: Material 3

    Material 3 (Resynthesis of prosody) 13 participants

    1 declarative 2

    responsable

    3 embarassed

    4 comforting 5 fond-liking 6 seductive 7 fascinated 8 jealous 9 thinking10

    incredulous

    11 sneaky12

    scandalised

    13 dazed

    Audio-only Visual-only Audio-visual

    24

    15.58 % 11.65 % 16.96 %

  • Material 1: Results per attitude25

    Declarative Interrogative Fond-likingComforting Seductive

    Fascinated Jealous Thinking Disbelieving Sarcastic Scandalized

    Dazed EmbarassedResponsible

    Exclamative

    Hurt

    0.170 0.066 0.788 0.290 0.368 0.761

    0.391 0.167 0.814 0.214 0.640 0.264

    0.357 0.273 0.150 0.617

  • Material 2: Results per attitude26

    Declarative Interrogative Fond-likingComforting

    Seductive Fascinated Jealous Thinking Disbelieving

    Sarcastic Scandalized Dazed EmbarassedResponsible

    Exclamative

    0.235 0.128 0.602 0.253 0.372

    0.550 0.286 0.227 0.424 0.181

    0.313 0.554 0.219 0.040 0.540

  • Material 3: Results per attitude27

    Declarative Fond-likingComforting Seductive Fascinated

    Jealous Thinking Disbelieving Sarcastic

    Scandalized Dazed EmbarassedResponsible

    0.124 0.105 0.143 0.091 0.231

    0.067 0.379 0.135 0.222

    0.333 0.200 0.000 0.192

  • Average results and conclusions

    Test type Audio-only Visual-only Audio-visual

    Auto-evaluation 62.50% 73.96% 73.96%

    Material 1 30.98% 35.47% 36.90%

    Material 2 26.00% 17.73% 31.72%

    Material 3 15.58% 11.65% 16.96%

    Objective 30.60% 24.14% 40.95%

    28

    Generally better results for Audio-Visual Rates for video-only decrease as animation is used Results of objective and subjective tests are comparable for Audio-only F0 and speech rhythm present discriminant signatures Head movement is not sufficient

  • Improve retargetting (Material 2) Improve resynthesis (Material 3)

    Blend prosodic and non-prosodic features Learn prosodic signatures

    Investigate other prosodic features Audio: intensity Video: eyebrow movements, eye gaze, eye blink

    Generate expressive dialogue

    29

    Future work

  • 30

    Thank you!

    Slide 1Purpose of studySlide 3Basic emotionsEmotions in speech animationMotivationTheoretical termsOur solutionTheoretical frameworkExpressive corpusAttitudes in corpusProsodic representationSlide 13Evaluation paradigmObjective evaluationMaterial 1Material 2Slide 18Slide 19Slide 20Auto-evaluationUser study: Material 1User study: Material 2User study: Material 3Material 1: Results per attitudeMaterial 2: Results per attitudeMaterial 3: Results per attitudeAverage results and conclusionsFuture workSlide 30