![Page 1: Trainable Videorealistic Speech Animation9.520/spring10/Classes/tony_9520class0… · · 2008-05-19Audio Synthesis Video. ... Real Audio. Speech . Recognition. Forced. Viterbi](https://reader031.vdocuments.us/reader031/viewer/2022022409/5b08d5d87f8b9ac90f8d275b/html5/thumbnails/1.jpg)
Trainable Videorealistic Speech Animation
Tony EzzatGadi
GeigerTomaso Poggio
CBCL/AI LabMIT
![Page 2: Trainable Videorealistic Speech Animation9.520/spring10/Classes/tony_9520class0… · · 2008-05-19Audio Synthesis Video. ... Real Audio. Speech . Recognition. Forced. Viterbi](https://reader031.vdocuments.us/reader031/viewer/2022022409/5b08d5d87f8b9ac90f8d275b/html5/thumbnails/2.jpg)
Outline
• Problem Setting• Previous Work• Our Approach• Results• Evaluation
![Page 3: Trainable Videorealistic Speech Animation9.520/spring10/Classes/tony_9520class0… · · 2008-05-19Audio Synthesis Video. ... Real Audio. Speech . Recognition. Forced. Viterbi](https://reader031.vdocuments.us/reader031/viewer/2022022409/5b08d5d87f8b9ac90f8d275b/html5/thumbnails/3.jpg)
Overview
VideoDatabase
“Air” “Badge”
Visual SpeechProcessing
“Badge”
2 Themes:Videorealism
Machine Learning
Mary101
![Page 4: Trainable Videorealistic Speech Animation9.520/spring10/Classes/tony_9520class0… · · 2008-05-19Audio Synthesis Video. ... Real Audio. Speech . Recognition. Forced. Viterbi](https://reader031.vdocuments.us/reader031/viewer/2022022409/5b08d5d87f8b9ac90f8d275b/html5/thumbnails/4.jpg)
Audio Analysis
VideoDatabase
“Air” “Badge”
Visual SpeechProcessing
“Badge”
AudioDatabase
Audio is recorded also to help label video
![Page 5: Trainable Videorealistic Speech Animation9.520/spring10/Classes/tony_9520class0… · · 2008-05-19Audio Synthesis Video. ... Real Audio. Speech . Recognition. Forced. Viterbi](https://reader031.vdocuments.us/reader031/viewer/2022022409/5b08d5d87f8b9ac90f8d275b/html5/thumbnails/5.jpg)
Audio Synthesis
VideoDatabase
“Air” “Badge”
Visual SpeechProcessing
“Badge”
AudioDatabase
“Badge”
Audio SpeechProcessingX
No Audio Synthesis!
![Page 6: Trainable Videorealistic Speech Animation9.520/spring10/Classes/tony_9520class0… · · 2008-05-19Audio Synthesis Video. ... Real Audio. Speech . Recognition. Forced. Viterbi](https://reader031.vdocuments.us/reader031/viewer/2022022409/5b08d5d87f8b9ac90f8d275b/html5/thumbnails/6.jpg)
What is the Input REALLY?
Visual SpeechProcessing
“Badge”
![Page 7: Trainable Videorealistic Speech Animation9.520/spring10/Classes/tony_9520class0… · · 2008-05-19Audio Synthesis Video. ... Real Audio. Speech . Recognition. Forced. Viterbi](https://reader031.vdocuments.us/reader031/viewer/2022022409/5b08d5d87f8b9ac90f8d275b/html5/thumbnails/7.jpg)
Input: Phone Stream
Visual SpeechProcessing
/SIL B B B AE AE JH JH SIL SIL/ Real Audio
Speech Recognition
ForcedViterbiAlignment
Manual Labelling
“Badge”
TTS
“Badge”
![Page 8: Trainable Videorealistic Speech Animation9.520/spring10/Classes/tony_9520class0… · · 2008-05-19Audio Synthesis Video. ... Real Audio. Speech . Recognition. Forced. Viterbi](https://reader031.vdocuments.us/reader031/viewer/2022022409/5b08d5d87f8b9ac90f8d275b/html5/thumbnails/8.jpg)
Pre-
and Post-Processing
Pre-Processing
Post-Processing
Remove head movementusing
planar perspectivewarping
Mask out mouthTrack & Recomposite
into background sequence
Visual SpeechProcessing
/SIL B B B AE AE JH JH SIL SIL/
![Page 9: Trainable Videorealistic Speech Animation9.520/spring10/Classes/tony_9520class0… · · 2008-05-19Audio Synthesis Video. ... Real Audio. Speech . Recognition. Forced. Viterbi](https://reader031.vdocuments.us/reader031/viewer/2022022409/5b08d5d87f8b9ac90f8d275b/html5/thumbnails/9.jpg)
Tracking & Compositing
![Page 10: Trainable Videorealistic Speech Animation9.520/spring10/Classes/tony_9520class0… · · 2008-05-19Audio Synthesis Video. ... Real Audio. Speech . Recognition. Forced. Viterbi](https://reader031.vdocuments.us/reader031/viewer/2022022409/5b08d5d87f8b9ac90f8d275b/html5/thumbnails/10.jpg)
Outline
• Problem Setting• Previous Work• Our Approach• Results• Evaluation• More Results
![Page 11: Trainable Videorealistic Speech Animation9.520/spring10/Classes/tony_9520class0… · · 2008-05-19Audio Synthesis Video. ... Real Audio. Speech . Recognition. Forced. Viterbi](https://reader031.vdocuments.us/reader031/viewer/2022022409/5b08d5d87f8b9ac90f8d275b/html5/thumbnails/11.jpg)
Video Rewrite(Bregler, Covell, Slaney 1997)
/H-E-L/ /E-L-OW/
+Hello:
Triphone
basis unitsReorder them to new utterancePixel blending at join points
Coarticulation: /utu/
vs
/iti/
![Page 12: Trainable Videorealistic Speech Animation9.520/spring10/Classes/tony_9520class0… · · 2008-05-19Audio Synthesis Video. ... Real Audio. Speech . Recognition. Forced. Viterbi](https://reader031.vdocuments.us/reader031/viewer/2022022409/5b08d5d87f8b9ac90f8d275b/html5/thumbnails/12.jpg)
• Sampling coarticulation20000 triphones
~ 3 hrs!
Video Rewrite Issues(Bregler, Covell, Slaney 1997)
• Model of speech is entire video corpusNo capacity to learn/model/distillNot a parsimonious representation
• Poor capacity for novel image synthesisPoor smoothing at join pointsCannot stretch/shrink to match audioDiscrete number of pathsCannot fill in missing data
![Page 13: Trainable Videorealistic Speech Animation9.520/spring10/Classes/tony_9520class0… · · 2008-05-19Audio Synthesis Video. ... Real Audio. Speech . Recognition. Forced. Viterbi](https://reader031.vdocuments.us/reader031/viewer/2022022409/5b08d5d87f8b9ac90f8d275b/html5/thumbnails/13.jpg)
Outline
• Problem Setting• Previous Work• Our Approach• Results• Evaluation• More Results
![Page 14: Trainable Videorealistic Speech Animation9.520/spring10/Classes/tony_9520class0… · · 2008-05-19Audio Synthesis Video. ... Real Audio. Speech . Recognition. Forced. Viterbi](https://reader031.vdocuments.us/reader031/viewer/2022022409/5b08d5d87f8b9ac90f8d275b/html5/thumbnails/14.jpg)
Extracting Prototypes
46 prototypes extracted using PCA and K-means clustering
![Page 15: Trainable Videorealistic Speech Animation9.520/spring10/Classes/tony_9520class0… · · 2008-05-19Audio Synthesis Video. ... Real Audio. Speech . Recognition. Forced. Viterbi](https://reader031.vdocuments.us/reader031/viewer/2022022409/5b08d5d87f8b9ac90f8d275b/html5/thumbnails/15.jpg)
Multidimensional Morphable
Model
1I
2I 3I
4I
2C 3C
4C
),( βα
![Page 16: Trainable Videorealistic Speech Animation9.520/spring10/Classes/tony_9520class0… · · 2008-05-19Audio Synthesis Video. ... Real Audio. Speech . Recognition. Forced. Viterbi](https://reader031.vdocuments.us/reader031/viewer/2022022409/5b08d5d87f8b9ac90f8d275b/html5/thumbnails/16.jpg)
MMM Background
Tommy Poggio/MITDavid BeymerMike JonesVinay
Kumar
Volker Blanz/MPI Saabrucken
Thomas Vetter/University of Basel
Tim Cootes/Manchester
Michael Black/Brown
![Page 17: Trainable Videorealistic Speech Animation9.520/spring10/Classes/tony_9520class0… · · 2008-05-19Audio Synthesis Video. ... Real Audio. Speech . Recognition. Forced. Viterbi](https://reader031.vdocuments.us/reader031/viewer/2022022409/5b08d5d87f8b9ac90f8d275b/html5/thumbnails/17.jpg)
1D Morphing
(Beier
& Neely 1992)
),( 111 FIWARP α
x x x
x x x
),( 222 FIWARP α
1 1
0 0
2β1β +
![Page 18: Trainable Videorealistic Speech Animation9.520/spring10/Classes/tony_9520class0… · · 2008-05-19Audio Synthesis Video. ... Real Audio. Speech . Recognition. Forced. Viterbi](https://reader031.vdocuments.us/reader031/viewer/2022022409/5b08d5d87f8b9ac90f8d275b/html5/thumbnails/18.jpg)
Optical Flow
C = {dx(x,y), dy(x,y)}
OpticalFlow
(Beymer, Shashua, Poggio 93) (Chen & Williams 93)
![Page 19: Trainable Videorealistic Speech Animation9.520/spring10/Classes/tony_9520class0… · · 2008-05-19Audio Synthesis Video. ... Real Audio. Speech . Recognition. Forced. Viterbi](https://reader031.vdocuments.us/reader031/viewer/2022022409/5b08d5d87f8b9ac90f8d275b/html5/thumbnails/19.jpg)
1D Morphing w/Optical Flow
Forward warping A to B
Forward warping B to A
Blending
Holefilling
![Page 20: Trainable Videorealistic Speech Animation9.520/spring10/Classes/tony_9520class0… · · 2008-05-19Audio Synthesis Video. ... Real Audio. Speech . Recognition. Forced. Viterbi](https://reader031.vdocuments.us/reader031/viewer/2022022409/5b08d5d87f8b9ac90f8d275b/html5/thumbnails/20.jpg)
Parameterize using
),( βα
),( βα
MMM Definition
46 Image prototypes from Corpus
1I
2I 3I
4I
2C 3C
4C
46 Optical flow betweenprototypes
alpha is 46-dimensionalbeta is 46 dimensional
![Page 21: Trainable Videorealistic Speech Animation9.520/spring10/Classes/tony_9520class0… · · 2008-05-19Audio Synthesis Video. ... Real Audio. Speech . Recognition. Forced. Viterbi](https://reader031.vdocuments.us/reader031/viewer/2022022409/5b08d5d87f8b9ac90f8d275b/html5/thumbnails/21.jpg)
MMM Synthesis
1I
2I 3I
4I
2C 3C
4C
∑=
=N
iii
synth CC1
1 α
synthC1
),( 1 isynth
isynthi CCCWC −=
),( synthii
warpi CIWI =
∑=
=N
i
warpii
morph II1
),( ββα
Fine, but whatabout speech?
![Page 22: Trainable Videorealistic Speech Animation9.520/spring10/Classes/tony_9520class0… · · 2008-05-19Audio Synthesis Video. ... Real Audio. Speech . Recognition. Forced. Viterbi](https://reader031.vdocuments.us/reader031/viewer/2022022409/5b08d5d87f8b9ac90f8d275b/html5/thumbnails/22.jpg)
Mary101 Speech Model
1I
2I 3I
4I
2C 3C
4C
/SIL/ /F/
/AE/ Each phoneme represents a cluster in MMM space
Speech trajectory passes close to clusters
but which is also smooth
![Page 23: Trainable Videorealistic Speech Animation9.520/spring10/Classes/tony_9520class0… · · 2008-05-19Audio Synthesis Video. ... Real Audio. Speech . Recognition. Forced. Viterbi](https://reader031.vdocuments.us/reader031/viewer/2022022409/5b08d5d87f8b9ac90f8d275b/html5/thumbnails/23.jpg)
),( βα
MMM Analysis
1I
2I 3I
4I
2C 3C
4C
![Page 24: Trainable Videorealistic Speech Animation9.520/spring10/Classes/tony_9520class0… · · 2008-05-19Audio Synthesis Video. ... Real Audio. Speech . Recognition. Forced. Viterbi](https://reader031.vdocuments.us/reader031/viewer/2022022409/5b08d5d87f8b9ac90f8d275b/html5/thumbnails/24.jpg)
MMM Analysis (Cntd)
1I
2I 3I
4I
2C 3C
4C
novelI
novelC
novelC
∑=
−N
iiinovel CC
1
α
Re-orient + Warp
10
1
=
∀>
−
∑
∑=
i
i
N
i
warpediinovel
itosubject
II
β
β
β
![Page 25: Trainable Videorealistic Speech Animation9.520/spring10/Classes/tony_9520class0… · · 2008-05-19Audio Synthesis Video. ... Real Audio. Speech . Recognition. Forced. Viterbi](https://reader031.vdocuments.us/reader031/viewer/2022022409/5b08d5d87f8b9ac90f8d275b/html5/thumbnails/25.jpg)
MMM Analysis Parameters
badge
lavish
Flow
Texture
![Page 26: Trainable Videorealistic Speech Animation9.520/spring10/Classes/tony_9520class0… · · 2008-05-19Audio Synthesis Video. ... Real Audio. Speech . Recognition. Forced. Viterbi](https://reader031.vdocuments.us/reader031/viewer/2022022409/5b08d5d87f8b9ac90f8d275b/html5/thumbnails/26.jpg)
Comparison of Real and Synthesized Images
Tongue is not perfect
Slight blurring Real Synthetic Real Synthetic
![Page 27: Trainable Videorealistic Speech Animation9.520/spring10/Classes/tony_9520class0… · · 2008-05-19Audio Synthesis Video. ... Real Audio. Speech . Recognition. Forced. Viterbi](https://reader031.vdocuments.us/reader031/viewer/2022022409/5b08d5d87f8b9ac90f8d275b/html5/thumbnails/27.jpg)
Analysis of Entire Recorded Corpus
),( 111 βα=z
1I
2I 3I
4I
2C 3C
4C
),( 222 βα=z
1I
2I 3I
4I
2C 3C
4C
),( 300003000030000 βα=z
1I
2I 3I
4I
2C 3C
4C
LVideo Corpus
/b/ /jh/ /ae/
![Page 28: Trainable Videorealistic Speech Animation9.520/spring10/Classes/tony_9520class0… · · 2008-05-19Audio Synthesis Video. ... Real Audio. Speech . Recognition. Forced. Viterbi](https://reader031.vdocuments.us/reader031/viewer/2022022409/5b08d5d87f8b9ac90f8d275b/html5/thumbnails/28.jpg)
Phonetic Clusters
pμ pΣRepresent each phone with
One set for flows, another set for textures
/t/
/w/
/m/
/aa/
/b/
![Page 29: Trainable Videorealistic Speech Animation9.520/spring10/Classes/tony_9520class0… · · 2008-05-19Audio Synthesis Video. ... Real Audio. Speech . Recognition. Forced. Viterbi](https://reader031.vdocuments.us/reader031/viewer/2022022409/5b08d5d87f8b9ac90f8d275b/html5/thumbnails/29.jpg)
Trajectory Synthesis
21 )()(min yyy T
yΔ+−Σ− − λμμ
⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢
⎣
⎡
=
Ty
yy
yM2
1
⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢
⎣
⎡
Σ
ΣΣ
=Σ
TP
P
P
O2
1
⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢
⎣
⎡
=
Tμ
μμ
μM2
1
Phonetic Targets Smoothness
/SIL B B B AE AE JH JH SIL SIL/
![Page 30: Trainable Videorealistic Speech Animation9.520/spring10/Classes/tony_9520class0… · · 2008-05-19Audio Synthesis Video. ... Real Audio. Speech . Recognition. Forced. Viterbi](https://reader031.vdocuments.us/reader031/viewer/2022022409/5b08d5d87f8b9ac90f8d275b/html5/thumbnails/30.jpg)
Smoothness
Higher orders of smoothness: K,, ΔΔΔΔΔOrder 2, 3, ….
⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢
⎣
⎡
−
−−
=Δ
II
IIII
O
![Page 31: Trainable Videorealistic Speech Animation9.520/spring10/Classes/tony_9520class0… · · 2008-05-19Audio Synthesis Video. ... Real Audio. Speech . Recognition. Forced. Viterbi](https://reader031.vdocuments.us/reader031/viewer/2022022409/5b08d5d87f8b9ac90f8d275b/html5/thumbnails/31.jpg)
Setting
Cross-validation:
flow: order 4, = 250: septic splinestexture: order 5, = 100: nintic
splines
Δ,λ
ΔΔ
λλ
![Page 32: Trainable Videorealistic Speech Animation9.520/spring10/Classes/tony_9520class0… · · 2008-05-19Audio Synthesis Video. ... Real Audio. Speech . Recognition. Forced. Viterbi](https://reader031.vdocuments.us/reader031/viewer/2022022409/5b08d5d87f8b9ac90f8d275b/html5/thumbnails/32.jpg)
Setting Phonetic Clusters
Use sample estimates?
/t/
/b/
Problem: Underarticulation!
![Page 33: Trainable Videorealistic Speech Animation9.520/spring10/Classes/tony_9520class0… · · 2008-05-19Audio Synthesis Video. ... Real Audio. Speech . Recognition. Forced. Viterbi](https://reader031.vdocuments.us/reader031/viewer/2022022409/5b08d5d87f8b9ac90f8d275b/html5/thumbnails/33.jpg)
Adjusting Phonetic Clusters
Use Gradient descent
to tweak
)()( yzyzE T −−=
ii
yyEE
μμ ∂∂
∂∂
=∂∂
μημμ∂∂
−=Eoldnew
Compare synthesized trajectory with original trajectory
{ }ttz βα ,={ }tty βα ,=
![Page 34: Trainable Videorealistic Speech Animation9.520/spring10/Classes/tony_9520class0… · · 2008-05-19Audio Synthesis Video. ... Real Audio. Speech . Recognition. Forced. Viterbi](https://reader031.vdocuments.us/reader031/viewer/2022022409/5b08d5d87f8b9ac90f8d275b/html5/thumbnails/34.jpg)
/t/
/b/
Phones Before/After Training
/t/
/b/before
after
![Page 35: Trainable Videorealistic Speech Animation9.520/spring10/Classes/tony_9520class0… · · 2008-05-19Audio Synthesis Video. ... Real Audio. Speech . Recognition. Forced. Viterbi](https://reader031.vdocuments.us/reader031/viewer/2022022409/5b08d5d87f8b9ac90f8d275b/html5/thumbnails/35.jpg)
Trajectories Before/After Training
12α
28β
![Page 36: Trainable Videorealistic Speech Animation9.520/spring10/Classes/tony_9520class0… · · 2008-05-19Audio Synthesis Video. ... Real Audio. Speech . Recognition. Forced. Viterbi](https://reader031.vdocuments.us/reader031/viewer/2022022409/5b08d5d87f8b9ac90f8d275b/html5/thumbnails/36.jpg)
Coarticulation
Model
1I
2I 3I
4I
2C 3C
4C
/B//U/
/T/Coarticulation
controlledby width
of cluster regions
/I/
![Page 37: Trainable Videorealistic Speech Animation9.520/spring10/Classes/tony_9520class0… · · 2008-05-19Audio Synthesis Video. ... Real Audio. Speech . Recognition. Forced. Viterbi](https://reader031.vdocuments.us/reader031/viewer/2022022409/5b08d5d87f8b9ac90f8d275b/html5/thumbnails/37.jpg)
Coarticulation
/utu/
/iti/ /ata/
/ubu/
/ibi/ /aba/
![Page 38: Trainable Videorealistic Speech Animation9.520/spring10/Classes/tony_9520class0… · · 2008-05-19Audio Synthesis Video. ... Real Audio. Speech . Recognition. Forced. Viterbi](https://reader031.vdocuments.us/reader031/viewer/2022022409/5b08d5d87f8b9ac90f8d275b/html5/thumbnails/38.jpg)
Big Picture
Trajectory Synthesis
MMM
Construct
MMM
{ }ii CI ,1I
2I 3I
4I
2C 3C
4C
Analyze Corpus{ }tt βα ,
Train phonetic models
{ }pp Σ,μ/SIL/ /F/
/AE/
Post-process
Pre-process/SIL B B B AE AE JH JH SIL SIL/
Synthesize!
{ }tt βα ,
![Page 39: Trainable Videorealistic Speech Animation9.520/spring10/Classes/tony_9520class0… · · 2008-05-19Audio Synthesis Video. ... Real Audio. Speech . Recognition. Forced. Viterbi](https://reader031.vdocuments.us/reader031/viewer/2022022409/5b08d5d87f8b9ac90f8d275b/html5/thumbnails/39.jpg)
Results
Mary101:
8 minutes of training data
1-syllable words: 132 training/20 test2-syllable words: 136 training/20 test
46-prototype MMM
Sentences not even included in training.
![Page 40: Trainable Videorealistic Speech Animation9.520/spring10/Classes/tony_9520class0… · · 2008-05-19Audio Synthesis Video. ... Real Audio. Speech . Recognition. Forced. Viterbi](https://reader031.vdocuments.us/reader031/viewer/2022022409/5b08d5d87f8b9ac90f8d275b/html5/thumbnails/40.jpg)
Comments So Far
“She looks like she’s been Botox’ed”--
Nobel Laureate
“Has she had a frontal lobotomy?”--
ATT executive
Send me your comments to
![Page 41: Trainable Videorealistic Speech Animation9.520/spring10/Classes/tony_9520class0… · · 2008-05-19Audio Synthesis Video. ... Real Audio. Speech . Recognition. Forced. Viterbi](https://reader031.vdocuments.us/reader031/viewer/2022022409/5b08d5d87f8b9ac90f8d275b/html5/thumbnails/41.jpg)
Visual Turing Tests
We win!
Experiment % correct P<Single
presentation 52.1% 0.3
Double presentation
46.6% 0.5
![Page 42: Trainable Videorealistic Speech Animation9.520/spring10/Classes/tony_9520class0… · · 2008-05-19Audio Synthesis Video. ... Real Audio. Speech . Recognition. Forced. Viterbi](https://reader031.vdocuments.us/reader031/viewer/2022022409/5b08d5d87f8b9ac90f8d275b/html5/thumbnails/42.jpg)
Visual Intelligibility
Still some work to do…….
Correct Phoneme ID
Experiment %correct on N %correct on S P<
Words+Sents 30.01% 21.19% 0.001
Words 38.55% 28.07% 0.001
Sents 24.38% 16.52% 0.01
![Page 43: Trainable Videorealistic Speech Animation9.520/spring10/Classes/tony_9520class0… · · 2008-05-19Audio Synthesis Video. ... Real Audio. Speech . Recognition. Forced. Viterbi](https://reader031.vdocuments.us/reader031/viewer/2022022409/5b08d5d87f8b9ac90f8d275b/html5/thumbnails/43.jpg)
Stay Tuned!
Acknowledgments:Association Christian BenoitNSFNTTITRI
Mary101
Dynasty ModelsCraig Milanesi
Dave KonstineJoanne Flood
Jay BenoitMarypat
Fitzgerald
Casey JohnsonVinay
Kumar
Sayan
MukherjeeChao Wang
Adlar
KimDanielle Suh
Osamu YoshimiVolker Blanz
Thomas VetterDemetri
Terzopoulos
Jenny Shapiro/BMGRehema
Ellis/NBC
Kevin Chang