audio-visual processing of speech with dnn...audio-visual processing of speech with dnn ido ariav...

Audio-visual processing of speech with DNN

Ido Ariav

Electrical Engineering DepartmentTechnion - Israel Institute of Technology

Supervised by Prof. Israel Cohen

Outline

▪ Background - Voice Activity Detection

▪ Deep Multimodal Architectures for Voice Activity Detection

▪ results

Voice Activity Detection (VAD)Some background..

Voice Activity Detection (VAD)

▪ Many applications - speech and speaker recognition, speech enhancement, dominant speaker identification, hearing-improvement devices, etc.

Voice Activity Detection (VAD)

▪ a preliminary block to other speech related applications

Traditional Methods

▪ simple acoustic features (e.g. zero-crossings), model-based methods (e.g. GMM)

Traditional Methods

Performance deteriorates in presence of noise

Cannot model highly non-stationary noise (transients)

-3 dB thresh-4 dB thresh

Deep NN

▪ Deep learning to the rescue!

Deep NN

▪ But wait… speech is a time-series so why should we treat it as a discrete classification problem?

speech

Multimodal

▪ Any other sensors we could use??

▪ Video is especially useful in challenging acoustic environments

Deep Multimodal Architectures for Voice Activity Detection

Problem Setting

▪ a multimodal setting, audio and video signals are both available.

Problem Setting

▪ Stationary background noise and transients (metronome, keyboard typing, hammering) are added to the clean signal

▪ 11 speakers, each recording 120 seconds long

Speech

transients

Deep architecture for VAD

Feature Extraction

▪ Audio Features - MFCC (Mel-frequency cepstral coefficients)

▪ Video Features - motion vectors (MV)

▪ MV capture both spatial and temporal information

Transient Reducing AE

▪ a special AE is designed for both fusing the audio and video signals, and reducing the effect of noises and transients

Clean

mushroom

Recurrent Neural Network

▪ The transient reducing AE is followed by a multilayered RNN

▪ The length of the temporal window is learned instead of being arbitrarily predetermined.

▪ a sigmoid on the RNN output produces a probability measure for the presence of speech in each frame 𝑛

Experimental Results

Comparison of our method to “Audio-Visual Voice Activity Detection Using Diffusion Maps” by Dov et al.

Our method produces less false alarms


Comparison to “Audio-Visual Voice Activity Detection Using Diffusion Maps” by Dov et al. and “robust audio-visual speech recognition using audio-visual voice activity detection“ by Tamura et al.

Colored noise with 5 dB SNR and hammering transient


Comparison to “Audio-Visual Voice Activity Detection Using Diffusion Maps” by Dov et al. and “robust audio-visual speech recognition using audio-visual voice activity detection“ by Tamura et al.

Babble noise with 10 dB SNR and keyboard transient

That’s nice, but still not end-to-end…

End-to-End VAD

Video Feature Extraction

▪ Residual networks

Audio Feature Extraction

▪ a WaveNet encoder

▪ stacked residual blocks of dilated convolutions

▪ captures long-range temporal dependencies


▪ Dilated convolutions

Feature Fusion - MCB

▪ Feature vectors fusion -

2048

2048

2048


▪ The best of all worlds – MCB

▪ approximated by projecting the jointouter product to a lower dimensionalspace, using a count sketch function

Whatever we choose..


▪ can easily be extended for more than two modalities

▪ able to choose the desired size for the joint vector

▪ MCB output size is set to be 1024

Dataset

▪ more challenging dataset compared to our previous work - each sample of the evaluation set contains a different mixture of background noise, transient, and SNR

▪ Training set – noised every iteration

▪ Evaluation set – noised once at init


Comparison of our method to “Audio-Visual Voice Activity Detection Using Diffusion Maps” by Dov et al. and our previous work


A comparison of our 4 different architectures –with MCB\concatenation, and with shared\joint LSTM

Shared LSTM + MCB is best

Discussion

▪ Features are learned from raw data

▪ fusion of the modalities via an MCB module, higher order relationsbetween the two modalities are explored

▪ Can be utilized to other domains (ECG)

Questions?Thank you..

audio-visual processing of speech with dnn...audio-visual processing of speech with dnn ido ariav...

Documents