40 cross-language speech dependent lip-synchronization · datasets results cross-language results...

1
Datasets Results Cross-language results Approach Cross-Language Speech Dependent Lip-Synchronization Abhishek Jha 1 Vikram Voleti 2 Vinay Namboodiri 3 C. V. Jawahar 1 1 CVIT, KCIS, IIIT Hyderabad, India 2 Mila, Université de Montréal, Canada 3 Department of Computer Science, IIT Kanpur, India Motivation Goal: Given a video in foreign language, and dubbed speech in regional language, synchronize the lips in video to match the dubbed audio. Exponential growth in internet user in the last few years. A majority of these users are college/school-goers and young adults. Diverse demographic of students all over the world enroll in MOOCs Challenges: Audio to lip fiducial generation, cross-language lip synchronization, cross identity lip synchronization. User-based study Email: [email protected] [email protected] Project page: preon.iiit.ac.in/~abhishek_jha/lipsync_inst_vid Hindi Speech corpus: 2.5 hours of audio visual speech GRID Corpus: 33 speakers, 1000 phrase each (51 words) Andrew Ng ML videos: 16000 image frames extracted from deeplearning.ai MOOC videos. Telugu movies: scenes with protagonist’s face, extracted from Telugu movies Contributions Developed a cross-language lip-synchronization model for dubbed speech videos (e.g. Hindi dubbing for English videos). Developed an automated pipeline to curate a dataset to train the proposed model. Conducted a user-based study, which shows learners prefer lip-sync for dubbed videos. Audio -> Lip landmarks : Train Time-delayed (TD) LSTM on grid corpus and Hindi speech corpus. Intermediate processing: To search for the best face to append with generated lip landmarks. Face + Lip landmarks-> New face : Train U-Net on videos from movie dataset. Inference: Generate lip-landmarks from audio using TD-LSTM, Generate new frames by modifying original Andrew Ng video frames using U-Net C: Comfort level US: Un-synced S: Lip-synced LS%: Lip-sync percentage Telugu Movies Andrew Ng GRID Corpus Hindi Speech Dataset curation - automated pipeline U-Net results Recomputed ROI frames Mask U-NET Landmark augmentation Lip generation Input video clip t1 t2 t3 tn . . . . . O1 Time delay . . . . . Lip-landmarks (40D) 100 x 12D MFCC features LSTM Time-delayed LSTM Generation using U-Net All lip-landmarks for dubbed audio Clustering (6 clusters) 40 Lip-landmarks representation 40 40 Cluster assignment Frame selection Intermediate-processing Frame extraction Target person’s image Face detection Face recognition Landmark extraction Lip polygon ROI masking Merge Hindi video Voice Activity Detection Video split MFCC extraction Face detection Landmark extraction Norm Audio Video U-Net input Ground truth 0.4 . . 0.2 1 1 . . . 0.3 1 0.5 . . . 1.0 ... .. TD-LSTM inputs 40D lip landmarks English videos pipeline Hindi videos pipeline Identity image Speech video Dubbing video C - US C - S LS% - US LS% - S Mean 2.51 3.1 23.86 45.95 Std. dev. 1.07 0.6 25.9 24.1 1. Audio -> Lip landmarks Video frame + New Lip landmarks video frame -> 2.

Upload: others

Post on 30-Jul-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 40 Cross-Language Speech Dependent Lip-Synchronization · Datasets Results Cross-language results Approach Cross-Language Speech Dependent Lip-Synchronization Abhishek Jha 1 Vikram

Datasets

Results

Cross-language results

Approach

Cross-Language Speech Dependent Lip-SynchronizationAbhishek Jha 1 Vikram Voleti2 Vinay Namboodiri 3 C. V. Jawahar 1

1 CVIT, KCIS, IIIT Hyderabad, India 2Mila, Université de Montréal, Canada 3 Department of Computer Science, IIT Kanpur, India

Motivation

Goal: Given a video in foreign language, and dubbed speech in regional language, synchronize the lips in video to match the dubbed audio.

❖ Exponential growth in internet user in the last few years.❖ A majority of these users are college/school-goers and young adults.❖ Diverse demographic of students all over the world enroll in MOOCs

Challenges: Audio to lip fiducial generation, cross-language lip synchronization, cross identity lip synchronization. User-based study

Email: [email protected]@gmail.com

Project page: preon.iiit.ac.in/~abhishek_jha/lipsync_inst_vid

● Hindi Speech corpus: 2.5 hours of audio visual speech

● GRID Corpus: 33 speakers, 1000 phrase each (51 words)

● Andrew Ng ML videos: 16000 image frames extracted from deeplearning.ai MOOC videos.

● Telugu movies: scenes with protagonist’s face, extracted from Telugu movies

Contributions❖ Developed a cross-language lip-synchronization model for dubbed

speech videos (e.g. Hindi dubbing for English videos).

❖ Developed an automated pipeline to curate a dataset to train the proposed model.

❖ Conducted a user-based study, which shows learners prefer lip-sync for dubbed videos.

❖ Audio -> Lip landmarks : Train Time-delayed (TD) LSTM on grid corpus and Hindi speech corpus.❖ Intermediate processing: To search for the best face to append with generated lip landmarks.

❖ Face + Lip landmarks-> New face : Train U-Net on videos from movie dataset.

❖ Inference:➢ Generate lip-landmarks from audio using TD-LSTM,➢ Generate new frames by modifying original Andrew Ng video frames using U-Net

● C: Comfort level● US: Un-synced ● S: Lip-synced● LS%: Lip-sync percentage

Telu

gu M

ovie

sA

nd

rew

Ng

GR

ID C

orp

us

Hin

di Sp

eech

Dataset curation - automated pipeline

U-Net results

Recomputed ROI frames

Mask

U-N

ET

Landmark augmentation

Lip

gen

erat

ion

Inp

ut

vid

eo c

lip

t1 t2 t3 tn. . . . .

O1Time delay

. . . . .

Lip-landmarks (40D)

100 x 12D MFCC features

LST

M

Time-delayed LSTM Generation using U-Net

All lip-landmarks for dubbed audio

Clustering (6 clusters)

ℝ40

Lip-landmarks representation

ℝ40

ℝ40

Clu

ster

ass

ign

men

t

Frame selection

Intermediate-processing

Frame extraction

Target person’s

image

Face detection

Face recognition

Landmark extraction

Lippolygon

ROI masking

Merge

Hindi video

VoiceActivity

Detection

Video split

MFCC extraction

Face detection

Landmark extraction

Norm

Audio

Vid

eo

U-N

et

inp

ut

Gro

un

d

tru

th

0.4 . .0.21

1 . . .0.31

0.5 . . .1.0 ...

..T

D-L

STM

in

pu

ts4

0D

lip

la

nd

mar

ks

En

glis

h v

ideo

s p

ipel

ine

Hin

di v

ideo

s p

ipel

ine

Identity image

Speech video

Dubbing video

C - US C - S LS% - US LS% - S

Mean 2.51 3.1 23.86 45.95

Std. dev. 1.07 0.6 25.9 24.1

1. Audio -> Lip landmarks Video frame + New Lip landmarks video frame

->2.