audio-visual processing of speech with dnn...audio-visual processing of speech with dnn ido ariav...
TRANSCRIPT
![Page 1: Audio-visual processing of speech with DNN...Audio-visual processing of speech with DNN Ido Ariav Electrical Engineering Department Technion - Israel Institute of Technology Supervised](https://reader030.vdocuments.us/reader030/viewer/2022041103/5f0269947e708231d4042658/html5/thumbnails/1.jpg)
Audio-visual processing of speech with DNN
Ido Ariav
Electrical Engineering DepartmentTechnion - Israel Institute of Technology
Supervised by Prof. Israel Cohen
![Page 2: Audio-visual processing of speech with DNN...Audio-visual processing of speech with DNN Ido Ariav Electrical Engineering Department Technion - Israel Institute of Technology Supervised](https://reader030.vdocuments.us/reader030/viewer/2022041103/5f0269947e708231d4042658/html5/thumbnails/2.jpg)
Outline
▪ Background - Voice Activity Detection
▪ Deep Multimodal Architectures for Voice Activity Detection
▪ results
![Page 3: Audio-visual processing of speech with DNN...Audio-visual processing of speech with DNN Ido Ariav Electrical Engineering Department Technion - Israel Institute of Technology Supervised](https://reader030.vdocuments.us/reader030/viewer/2022041103/5f0269947e708231d4042658/html5/thumbnails/3.jpg)
Voice Activity Detection (VAD)Some background..
![Page 4: Audio-visual processing of speech with DNN...Audio-visual processing of speech with DNN Ido Ariav Electrical Engineering Department Technion - Israel Institute of Technology Supervised](https://reader030.vdocuments.us/reader030/viewer/2022041103/5f0269947e708231d4042658/html5/thumbnails/4.jpg)
Voice Activity Detection (VAD)
▪ Many applications - speech and speaker recognition, speech enhancement, dominant speaker identification, hearing-improvement devices, etc.
![Page 5: Audio-visual processing of speech with DNN...Audio-visual processing of speech with DNN Ido Ariav Electrical Engineering Department Technion - Israel Institute of Technology Supervised](https://reader030.vdocuments.us/reader030/viewer/2022041103/5f0269947e708231d4042658/html5/thumbnails/5.jpg)
Voice Activity Detection (VAD)
▪ a preliminary block to other speech related applications
![Page 6: Audio-visual processing of speech with DNN...Audio-visual processing of speech with DNN Ido Ariav Electrical Engineering Department Technion - Israel Institute of Technology Supervised](https://reader030.vdocuments.us/reader030/viewer/2022041103/5f0269947e708231d4042658/html5/thumbnails/6.jpg)
Traditional Methods
▪ simple acoustic features (e.g. zero-crossings), model-based methods (e.g. GMM)
![Page 7: Audio-visual processing of speech with DNN...Audio-visual processing of speech with DNN Ido Ariav Electrical Engineering Department Technion - Israel Institute of Technology Supervised](https://reader030.vdocuments.us/reader030/viewer/2022041103/5f0269947e708231d4042658/html5/thumbnails/7.jpg)
Traditional Methods
Performance deteriorates in presence of noise
Cannot model highly non-stationary noise (transients)
-3 dB thresh-4 dB thresh
![Page 8: Audio-visual processing of speech with DNN...Audio-visual processing of speech with DNN Ido Ariav Electrical Engineering Department Technion - Israel Institute of Technology Supervised](https://reader030.vdocuments.us/reader030/viewer/2022041103/5f0269947e708231d4042658/html5/thumbnails/8.jpg)
Deep NN
▪ Deep learning to the rescue!
![Page 9: Audio-visual processing of speech with DNN...Audio-visual processing of speech with DNN Ido Ariav Electrical Engineering Department Technion - Israel Institute of Technology Supervised](https://reader030.vdocuments.us/reader030/viewer/2022041103/5f0269947e708231d4042658/html5/thumbnails/9.jpg)
Deep NN
▪ But wait… speech is a time-series so why should we treat it as a discrete classification problem?
speech
![Page 10: Audio-visual processing of speech with DNN...Audio-visual processing of speech with DNN Ido Ariav Electrical Engineering Department Technion - Israel Institute of Technology Supervised](https://reader030.vdocuments.us/reader030/viewer/2022041103/5f0269947e708231d4042658/html5/thumbnails/10.jpg)
Multimodal
▪ Any other sensors we could use??
▪ Video is especially useful in challenging acoustic environments
![Page 11: Audio-visual processing of speech with DNN...Audio-visual processing of speech with DNN Ido Ariav Electrical Engineering Department Technion - Israel Institute of Technology Supervised](https://reader030.vdocuments.us/reader030/viewer/2022041103/5f0269947e708231d4042658/html5/thumbnails/11.jpg)
Deep Multimodal Architectures for Voice Activity Detection
![Page 12: Audio-visual processing of speech with DNN...Audio-visual processing of speech with DNN Ido Ariav Electrical Engineering Department Technion - Israel Institute of Technology Supervised](https://reader030.vdocuments.us/reader030/viewer/2022041103/5f0269947e708231d4042658/html5/thumbnails/12.jpg)
Problem Setting
▪ a multimodal setting, audio and video signals are both available.
![Page 13: Audio-visual processing of speech with DNN...Audio-visual processing of speech with DNN Ido Ariav Electrical Engineering Department Technion - Israel Institute of Technology Supervised](https://reader030.vdocuments.us/reader030/viewer/2022041103/5f0269947e708231d4042658/html5/thumbnails/13.jpg)
Problem Setting
▪ Stationary background noise and transients (metronome, keyboard typing, hammering) are added to the clean signal
▪ 11 speakers, each recording 120 seconds long
Speech
transients
![Page 14: Audio-visual processing of speech with DNN...Audio-visual processing of speech with DNN Ido Ariav Electrical Engineering Department Technion - Israel Institute of Technology Supervised](https://reader030.vdocuments.us/reader030/viewer/2022041103/5f0269947e708231d4042658/html5/thumbnails/14.jpg)
Deep architecture for VAD
![Page 15: Audio-visual processing of speech with DNN...Audio-visual processing of speech with DNN Ido Ariav Electrical Engineering Department Technion - Israel Institute of Technology Supervised](https://reader030.vdocuments.us/reader030/viewer/2022041103/5f0269947e708231d4042658/html5/thumbnails/15.jpg)
Feature Extraction
▪ Audio Features - MFCC (Mel-frequency cepstral coefficients)
▪ Video Features - motion vectors (MV)
▪ MV capture both spatial and temporal information
![Page 16: Audio-visual processing of speech with DNN...Audio-visual processing of speech with DNN Ido Ariav Electrical Engineering Department Technion - Israel Institute of Technology Supervised](https://reader030.vdocuments.us/reader030/viewer/2022041103/5f0269947e708231d4042658/html5/thumbnails/16.jpg)
Transient Reducing AE
▪ a special AE is designed for both fusing the audio and video signals, and reducing the effect of noises and transients
Clean
mushroom
![Page 17: Audio-visual processing of speech with DNN...Audio-visual processing of speech with DNN Ido Ariav Electrical Engineering Department Technion - Israel Institute of Technology Supervised](https://reader030.vdocuments.us/reader030/viewer/2022041103/5f0269947e708231d4042658/html5/thumbnails/17.jpg)
Recurrent Neural Network
▪ The transient reducing AE is followed by a multilayered RNN
▪ The length of the temporal window is learned instead of being arbitrarily predetermined.
▪ a sigmoid on the RNN output produces a probability measure for the presence of speech in each frame 𝑛
![Page 18: Audio-visual processing of speech with DNN...Audio-visual processing of speech with DNN Ido Ariav Electrical Engineering Department Technion - Israel Institute of Technology Supervised](https://reader030.vdocuments.us/reader030/viewer/2022041103/5f0269947e708231d4042658/html5/thumbnails/18.jpg)
Experimental Results
Comparison of our method to “Audio-Visual Voice Activity Detection Using Diffusion Maps” by Dov et al.
Our method produces less false alarms
![Page 19: Audio-visual processing of speech with DNN...Audio-visual processing of speech with DNN Ido Ariav Electrical Engineering Department Technion - Israel Institute of Technology Supervised](https://reader030.vdocuments.us/reader030/viewer/2022041103/5f0269947e708231d4042658/html5/thumbnails/19.jpg)
Experimental Results
Comparison to “Audio-Visual Voice Activity Detection Using Diffusion Maps” by Dov et al. and “robust audio-visual speech recognition using audio-visual voice activity detection“ by Tamura et al.
Colored noise with 5 dB SNR and hammering transient
![Page 20: Audio-visual processing of speech with DNN...Audio-visual processing of speech with DNN Ido Ariav Electrical Engineering Department Technion - Israel Institute of Technology Supervised](https://reader030.vdocuments.us/reader030/viewer/2022041103/5f0269947e708231d4042658/html5/thumbnails/20.jpg)
Experimental Results
Comparison to “Audio-Visual Voice Activity Detection Using Diffusion Maps” by Dov et al. and “robust audio-visual speech recognition using audio-visual voice activity detection“ by Tamura et al.
Babble noise with 10 dB SNR and keyboard transient
![Page 21: Audio-visual processing of speech with DNN...Audio-visual processing of speech with DNN Ido Ariav Electrical Engineering Department Technion - Israel Institute of Technology Supervised](https://reader030.vdocuments.us/reader030/viewer/2022041103/5f0269947e708231d4042658/html5/thumbnails/21.jpg)
That’s nice, but still not end-to-end…
![Page 22: Audio-visual processing of speech with DNN...Audio-visual processing of speech with DNN Ido Ariav Electrical Engineering Department Technion - Israel Institute of Technology Supervised](https://reader030.vdocuments.us/reader030/viewer/2022041103/5f0269947e708231d4042658/html5/thumbnails/22.jpg)
End-to-End VAD
![Page 23: Audio-visual processing of speech with DNN...Audio-visual processing of speech with DNN Ido Ariav Electrical Engineering Department Technion - Israel Institute of Technology Supervised](https://reader030.vdocuments.us/reader030/viewer/2022041103/5f0269947e708231d4042658/html5/thumbnails/23.jpg)
Video Feature Extraction
▪ Residual networks
![Page 24: Audio-visual processing of speech with DNN...Audio-visual processing of speech with DNN Ido Ariav Electrical Engineering Department Technion - Israel Institute of Technology Supervised](https://reader030.vdocuments.us/reader030/viewer/2022041103/5f0269947e708231d4042658/html5/thumbnails/24.jpg)
Audio Feature Extraction
▪ a WaveNet encoder
▪ stacked residual blocks of dilated convolutions
▪ captures long-range temporal dependencies
![Page 25: Audio-visual processing of speech with DNN...Audio-visual processing of speech with DNN Ido Ariav Electrical Engineering Department Technion - Israel Institute of Technology Supervised](https://reader030.vdocuments.us/reader030/viewer/2022041103/5f0269947e708231d4042658/html5/thumbnails/25.jpg)
Audio Feature Extraction
▪ Dilated convolutions
![Page 26: Audio-visual processing of speech with DNN...Audio-visual processing of speech with DNN Ido Ariav Electrical Engineering Department Technion - Israel Institute of Technology Supervised](https://reader030.vdocuments.us/reader030/viewer/2022041103/5f0269947e708231d4042658/html5/thumbnails/26.jpg)
Audio Feature Extraction
![Page 27: Audio-visual processing of speech with DNN...Audio-visual processing of speech with DNN Ido Ariav Electrical Engineering Department Technion - Israel Institute of Technology Supervised](https://reader030.vdocuments.us/reader030/viewer/2022041103/5f0269947e708231d4042658/html5/thumbnails/27.jpg)
Feature Fusion - MCB
▪ Feature vectors fusion -
2048
2048
2048
![Page 28: Audio-visual processing of speech with DNN...Audio-visual processing of speech with DNN Ido Ariav Electrical Engineering Department Technion - Israel Institute of Technology Supervised](https://reader030.vdocuments.us/reader030/viewer/2022041103/5f0269947e708231d4042658/html5/thumbnails/28.jpg)
Feature Fusion - MCB
▪ The best of all worlds – MCB
▪ approximated by projecting the jointouter product to a lower dimensionalspace, using a count sketch function
Whatever we choose..
![Page 29: Audio-visual processing of speech with DNN...Audio-visual processing of speech with DNN Ido Ariav Electrical Engineering Department Technion - Israel Institute of Technology Supervised](https://reader030.vdocuments.us/reader030/viewer/2022041103/5f0269947e708231d4042658/html5/thumbnails/29.jpg)
Feature Fusion - MCB
![Page 30: Audio-visual processing of speech with DNN...Audio-visual processing of speech with DNN Ido Ariav Electrical Engineering Department Technion - Israel Institute of Technology Supervised](https://reader030.vdocuments.us/reader030/viewer/2022041103/5f0269947e708231d4042658/html5/thumbnails/30.jpg)
Feature Fusion - MCB
▪ can easily be extended for more than two modalities
▪ able to choose the desired size for the joint vector
▪ MCB output size is set to be 1024
![Page 31: Audio-visual processing of speech with DNN...Audio-visual processing of speech with DNN Ido Ariav Electrical Engineering Department Technion - Israel Institute of Technology Supervised](https://reader030.vdocuments.us/reader030/viewer/2022041103/5f0269947e708231d4042658/html5/thumbnails/31.jpg)
Dataset
▪ more challenging dataset compared to our previous work - each sample of the evaluation set contains a different mixture of background noise, transient, and SNR
▪ Training set – noised every iteration
▪ Evaluation set – noised once at init
![Page 32: Audio-visual processing of speech with DNN...Audio-visual processing of speech with DNN Ido Ariav Electrical Engineering Department Technion - Israel Institute of Technology Supervised](https://reader030.vdocuments.us/reader030/viewer/2022041103/5f0269947e708231d4042658/html5/thumbnails/32.jpg)
Experimental Results
Comparison of our method to “Audio-Visual Voice Activity Detection Using Diffusion Maps” by Dov et al. and our previous work
![Page 33: Audio-visual processing of speech with DNN...Audio-visual processing of speech with DNN Ido Ariav Electrical Engineering Department Technion - Israel Institute of Technology Supervised](https://reader030.vdocuments.us/reader030/viewer/2022041103/5f0269947e708231d4042658/html5/thumbnails/33.jpg)
Experimental Results
A comparison of our 4 different architectures –with MCB\concatenation, and with shared\joint LSTM
Shared LSTM + MCB is best
![Page 34: Audio-visual processing of speech with DNN...Audio-visual processing of speech with DNN Ido Ariav Electrical Engineering Department Technion - Israel Institute of Technology Supervised](https://reader030.vdocuments.us/reader030/viewer/2022041103/5f0269947e708231d4042658/html5/thumbnails/34.jpg)
Discussion
▪ Features are learned from raw data
▪ fusion of the modalities via an MCB module, higher order relationsbetween the two modalities are explored
▪ Can be utilized to other domains (ECG)
![Page 35: Audio-visual processing of speech with DNN...Audio-visual processing of speech with DNN Ido Ariav Electrical Engineering Department Technion - Israel Institute of Technology Supervised](https://reader030.vdocuments.us/reader030/viewer/2022041103/5f0269947e708231d4042658/html5/thumbnails/35.jpg)
Questions?Thank you..