centrum
Post on 05-Jan-2016
39 Views
Preview:
DESCRIPTION
TRANSCRIPT
Multi-modal expression of Swedish prominence
Björn Granström
Centre for Speech Technology, Department of Speech, Music and Hearing, KTH, Stockholm,
Sweden
TTCentrum för talteknologi
Historical background
• Prosody for speech synthesis at KTH, together with Rolf Carlson
• The Lund intonation model – Gösta Bruce et al.
Several joint projectsProfs – Prosodic phrasing in Swedish ~1989-1992Gösta Bruce, Björn Granström and moreFirst reference: G. Bruce and B. Granström. Modelling
Swedish intonation in a text-to-speech system. STL-QPSR, 30(1):17-21, 1989. (on the KTH web)
Potentially ambiguous sentences, varying in phrase boundary location
Entering greve Piper´s humble residence
Windows Explorer (2).lnk
Several joint projects, cont.
Prosodiag - Prosodic Segmentation and Structuring of Dialogue (HSFR + NUTEK) 1993 –1996
Gösta Bruce, Björn Granström, Kjell Gustafson, David House, Paul Touati
Project DescriptionThe object of study is the prosody of dialogue in a language technology
framework. The primary goal of the project is to increase our understanding of how prosodic aspects of speech are exploited interactively in dialogue and on the basis of this increased knowledge to be able to create a more powerful prosody model.
Late reference: Gösta Bruce, Johan Frid, Björn Granström, Kjell Gustafson, Merle Home, and David House. Prosodic segmentation and structuring of dialogue. TMH-QPSR, 37(3):1-6, 1996.
More than 20 joint publications – and then?
Much in the context of the annual phonetics meetings – next:
Project meetings in inspirering surroundings
..probing many different cultures
Is prosody more than sound?
• Our bias: communication is multi-modal• Traditionally prosodic functions are signaled
by “gestures”, perceived by “eye and ear”• This concerns both body and face gestures• Preliminary hypothesis: F0~eyebrow height
- e.g. Cavé et al. (1996)• Easy to put to a test with multimodal
speech synthesis
Eyebrow vs intonation
“Jag heter Axel, inte Axell” (translation: “My name is Axel, not Axell”). In Sweden Axel is a first name as opposed to Axell, which is a family name.
1 No eyebrow motion
2 Eyebrow motion controlled by the fundamental frequency of the voice
3 Eyebrow motion at focal accents +
4 Eyebrow motion at the first focal accent +
Goals and research context
• How are visual expressions used to convey and strengthen prosodic functions?
• Understand interactions between visual expressions, dialog functions and speech acoustics
• Context: animated talking agent– Realistic communicative behavior using
multimodal speech synthesis
Visual prosodic functions
• Prominence– stress– focus
• Phrasing• Utterance type
– question– statement
• Dialogue functions– back channeling– turntaking
• Attitudes• Emotions
Visual prosody cont.• What is underlying? • How tight is the AV connection?• What are the important visual
gestures?• More optional than acoustic prosodic
parameters?• Individual and cultural variation• Reinforcing or qualifying acoustics?
Formal experimentProminence due to eyebrow
rise5 content words: ”När pappa fiskar stör piper
Putte”When dad is fishing sturgeon, Putte is whimpering
Example of stimuliTask: “which word is most prominent”
(identical acoustics – varied location of eyebrow movement)
No eyebrow movement (neutral)
Eyebrow movement
Prominence increase due to eyebrow movement
Influence on judged prominence by eyebrow movement
0
10
20
30
40
50
Swedish Foreign All
% p
rom
inen
ce d
ue
to e
yeb
row
mo
vem
ent
Feedback experiment
• Mini dialogues (two turns)• Travel agent application• Both visual and acoustic feedback cues• Affirmative cues – agent
understands/accepts the request • Negative cues – agent is unsure about
the request (seeks confirmation)• Six cues hypothesised
Granström, House & Swerts (2002)
Pos/Neg feedback
experiment
Affirmative setting Negative settingSmile Head smiles Head has neutral expressionHead movement Head nods Head leans backEyebrows Eyebrows rise Eyebrows frownEye closure Eyes close a bit Eyes open widelyF0 contour Declarative intonation Interrogative intonationDelay Immediate reply Slow reply
(Granström, House & Swerts 2002)
Cue strength
0
0,5
1
1,5
2
2,5
3
Ave
rage
res
pons
e va
lue
Recording of communicative
interactions
Automatic tracking of reflective spots in 3D (Qualisys)
Interactions: emotion and articulation (resynthesis)
(from AV speech database – EU/PF_STAR project)
Measurement points for lip coarticulation
analysis
Lateral distance
Vertical distance
left mouth corner
The expressive mouth
• All vowels(sentences)
– Encouraging– Happy– Angry– Sad– Neutral
”left mouth corner”
(Svanfeldt et al. 2003)
Prompted read speech database
• Expressive modes: – Confirming, questioning, certain, uncertain, happy,
(angry)
• 39 short, content neutral sentences with three possible focal accent positions each, e.g.
• Båten seglade förbi (The boat sailed by) • Dom flyttade möblerna (They moved the furniture)
• Nonsense words (VCV, VCCV, CVC)• Digits
Mean eyebrow positions for one speaker
Nose marker traces with automatic (blue) and two human (red)annotated head nods (adapted from Cerrato & Svanfeldt 2006)
Examples from the databaseC
on
firm
ing
H
ap
py
Focal accent on: Båten seglade förbi
Exploitation of visual parameters
• Visual cues exploited at focal accent• Mouth cues
– Happy, encouraging
• Eyebrow cues– Happy, questioning
• Vertical head nods– Confirming
Analysis in terms of FAP and FMQ
MPEG-4 Facial Animation Parameter (FAP) A subset of 31 FAPs out of the 68 FAPs defined in the MPEG-4 standard, including only the ones that we were able to calculate directly from our measured point data
Focal Motion Quotient, FMQ, defined as the standard deviation of a FAP parameter taken over a word in focal position, divided by the average standard deviation of the same FAP in the same word in non-focal position.
0
0,5
1
1,5
2
2,5
3
3,5
4
4,5
3: open jaw
14: thrust jaw
15: shift jaw
18: depress chin
39: puff left cheek
40: puff right cheek
41: lift left cheek
42: lift right cheek
16: push bottom lip
52: raise bottom m
idlip
57: raise bottom lip lm
58: raise bottom lip rm
17: push top lip
51: lower top m
idlip
55: lower top lip left m
id
56: lower top lip rm
53: strech left cornerlip
54: strech right cornerlip
59: raise left cornerlip
60: raise right cornerlip
31: raise left inner eyebrow
32: raise right inner eyebrow
33: raise left mid eyebrow
34: raise right mid eyebrow
35: raise left outer eyebrow
36: raise right outer eyebrow
37: squeeze left eyebrow
38: squeeze right eyebrow
48: head pitch
49: head yaw
50: head roll
FAP
Angry
Happy
Confirming
Questioning
Certain
Uncertain
Neutral
The focal motion quotient, FMQ, averaged across all sentences, for all measured MPEG-4 FAPs for several
expressive modes
articulation I smile I brows I head
The effect of focus on the variation of several groups of MPG-4 /FAP parameters,
for different expressive modes
0
0,5
1
1,5
2
2,5
3
An
gry
Ha
pp
y
Co
nfirm
ing
Qu
estio
nin
g
Ce
rtain
Un
certa
in
Ne
utra
l
articulationsmilebrowshead
FM
Q (
Fo
cal
Mo
tio
n Q
uo
tien
t)
The effect of focal accent on selected parameter variations in Certain and Uncertain
readings
0
0,5
1
1,5
2
2,5
3
3,5
4
Certain Uncertain
31: raise left innereyebrow
32: raise right innereyebrow
33: raise left mideyebrow
34: raise right mideyebrow
48: head pitch
49: head yaw
FM
Q (
Fo
cal
Mo
tio
n Q
uo
tien
t)
What´s next?
• Better recordings• Detailed analysis of the eye region:
”Gaze and wrinkles”• Use in applications, e.g. spoken
dialogue systems• And more audible prosody…….
New cooperative project
SIMULEKT - Simulering av svenskans prosodiska dialekttyper (Simulating intonational varieties of Swedish)
VR 2007-2009
And finally………..
Congratulations!
Well done Gösta!
top related