![Page 1: Model-Based Fusion of Bone and Air Sensors for Speech Enhancement and Robust Speech Recognition John Hershey, Trausti Kristjansson, Zhengyou Zhang, Alex](https://reader033.vdocuments.us/reader033/viewer/2022051516/56649d365503460f94a0e513/html5/thumbnails/1.jpg)
Model-Based Fusion of Bone and Air Sensors for Speech Enhancement and
Robust Speech Recognition
John Hershey, Trausti Kristjansson, Zhengyou Zhang, Alex Acero: Microsoft Research
With help from Zicheng Liu and Asela Gunawardana
![Page 2: Model-Based Fusion of Bone and Air Sensors for Speech Enhancement and Robust Speech Recognition John Hershey, Trausti Kristjansson, Zhengyou Zhang, Alex](https://reader033.vdocuments.us/reader033/viewer/2022051516/56649d365503460f94a0e513/html5/thumbnails/2.jpg)
Bone Sensor for Robust Speech Recognition
• Small amounts of interfering noise are disasterous for ASR.
• Standard air microphones are susceptible to noise.
• A bone sensor is more resistant to noise, but produces distortion.
• There is not enough data yet to train speech recognition using the bone sensor.
• Enhancement that fuses bone and air sensors allows us to use existing speech recognizers
![Page 3: Model-Based Fusion of Bone and Air Sensors for Speech Enhancement and Robust Speech Recognition John Hershey, Trausti Kristjansson, Zhengyou Zhang, Alex](https://reader033.vdocuments.us/reader033/viewer/2022051516/56649d365503460f94a0e513/html5/thumbnails/3.jpg)
Sensor Platform
![Page 4: Model-Based Fusion of Bone and Air Sensors for Speech Enhancement and Robust Speech Recognition John Hershey, Trausti Kristjansson, Zhengyou Zhang, Alex](https://reader033.vdocuments.us/reader033/viewer/2022051516/56649d365503460f94a0e513/html5/thumbnails/4.jpg)
Noise Suppression in Bone Sensor
AudioAudio
Bone signalBone signal
speechspeech
non-speechnon-speechspeechspeech
![Page 5: Model-Based Fusion of Bone and Air Sensors for Speech Enhancement and Robust Speech Recognition John Hershey, Trausti Kristjansson, Zhengyou Zhang, Alex](https://reader033.vdocuments.us/reader033/viewer/2022051516/56649d365503460f94a0e513/html5/thumbnails/5.jpg)
Relationship between air and bone signals
![Page 6: Model-Based Fusion of Bone and Air Sensors for Speech Enhancement and Robust Speech Recognition John Hershey, Trausti Kristjansson, Zhengyou Zhang, Alex](https://reader033.vdocuments.us/reader033/viewer/2022051516/56649d365503460f94a0e513/html5/thumbnails/6.jpg)
Articulation-Dependent Relationship Between Air and Bone Signals
Air
Bone
![Page 7: Model-Based Fusion of Bone and Air Sensors for Speech Enhancement and Robust Speech Recognition John Hershey, Trausti Kristjansson, Zhengyou Zhang, Alex](https://reader033.vdocuments.us/reader033/viewer/2022051516/56649d365503460f94a0e513/html5/thumbnails/7.jpg)
High Resolution Witty Enhancement
High-resolution log spectrum takes advantage of harmonics.
![Page 8: Model-Based Fusion of Bone and Air Sensors for Speech Enhancement and Robust Speech Recognition John Hershey, Trausti Kristjansson, Zhengyou Zhang, Alex](https://reader033.vdocuments.us/reader033/viewer/2022051516/56649d365503460f94a0e513/html5/thumbnails/8.jpg)
Speech Model
Adaptive noise model: (handful of states)
Pre-trained speech model:
(hundreds of states)
![Page 9: Model-Based Fusion of Bone and Air Sensors for Speech Enhancement and Robust Speech Recognition John Hershey, Trausti Kristjansson, Zhengyou Zhang, Alex](https://reader033.vdocuments.us/reader033/viewer/2022051516/56649d365503460f94a0e513/html5/thumbnails/9.jpg)
Noisy Speech Model
Speech State
Air Signal
Bone Signal
Noisy Air Mic
Noisy Bone Mic
Noise States
Speech Model
Trained on small speech database
Noisy Speech Model
Adapted at test time
![Page 10: Model-Based Fusion of Bone and Air Sensors for Speech Enhancement and Robust Speech Recognition John Hershey, Trausti Kristjansson, Zhengyou Zhang, Alex](https://reader033.vdocuments.us/reader033/viewer/2022051516/56649d365503460f94a0e513/html5/thumbnails/10.jpg)
Sensor Model
Frequency domain combination of signal and noise:
Relationship between magnitudes:
Relationship between log spectra:
and
is treated as probabilistic error
Model distribution over sensors is thus:
where
![Page 11: Model-Based Fusion of Bone and Air Sensors for Speech Enhancement and Robust Speech Recognition John Hershey, Trausti Kristjansson, Zhengyou Zhang, Alex](https://reader033.vdocuments.us/reader033/viewer/2022051516/56649d365503460f94a0e513/html5/thumbnails/11.jpg)
Inference
Speech posterior:
This is intractable due to nonlinearity. Algonquin: Iterate Laplace method. Gaussian estimate of posterior of speech and noise for each combination of states,
with mean and variance
and state posterior, (see paper for details)
Compute posterior speech mean:
Then resynthesize using lapped transform and noisy phases.
![Page 12: Model-Based Fusion of Bone and Air Sensors for Speech Enhancement and Robust Speech Recognition John Hershey, Trausti Kristjansson, Zhengyou Zhang, Alex](https://reader033.vdocuments.us/reader033/viewer/2022051516/56649d365503460f94a0e513/html5/thumbnails/12.jpg)
Adaptation
Since noise is unpredictable, we adapt the noise model at inference time, keeping the speech model constant.
The generalized EM algorithm yields the following update equations (see paper for details). The averages over time <>t can be done either in batch or online.
Here, is the posterior state probability, , and are the posterior mean and covariance of the noise given the state.
![Page 13: Model-Based Fusion of Bone and Air Sensors for Speech Enhancement and Robust Speech Recognition John Hershey, Trausti Kristjansson, Zhengyou Zhang, Alex](https://reader033.vdocuments.us/reader033/viewer/2022051516/56649d365503460f94a0e513/html5/thumbnails/13.jpg)
Air Signal
![Page 14: Model-Based Fusion of Bone and Air Sensors for Speech Enhancement and Robust Speech Recognition John Hershey, Trausti Kristjansson, Zhengyou Zhang, Alex](https://reader033.vdocuments.us/reader033/viewer/2022051516/56649d365503460f94a0e513/html5/thumbnails/14.jpg)
Bone Signal
![Page 15: Model-Based Fusion of Bone and Air Sensors for Speech Enhancement and Robust Speech Recognition John Hershey, Trausti Kristjansson, Zhengyou Zhang, Alex](https://reader033.vdocuments.us/reader033/viewer/2022051516/56649d365503460f94a0e513/html5/thumbnails/15.jpg)
Enhanced Signal
![Page 16: Model-Based Fusion of Bone and Air Sensors for Speech Enhancement and Robust Speech Recognition John Hershey, Trausti Kristjansson, Zhengyou Zhang, Alex](https://reader033.vdocuments.us/reader033/viewer/2022051516/56649d365503460f94a0e513/html5/thumbnails/16.jpg)
Evaluation
• Four subjects reading 41 sentences each of the Wall Street Journal.
• Speaker-dependent models of clean speech training set
• Test set: Same sentences read in same environment with interfering male speaker
• Recognizer: 500K vocabulary (Microsoft)
![Page 17: Model-Based Fusion of Bone and Air Sensors for Speech Enhancement and Robust Speech Recognition John Hershey, Trausti Kristjansson, Zhengyou Zhang, Alex](https://reader033.vdocuments.us/reader033/viewer/2022051516/56649d365503460f94a0e513/html5/thumbnails/17.jpg)
Parameters
• Audio was 16-bit, 16kHz
• Frames were 50ms with 20ms frame shift.
• Frequency resolution was 256 bins.
• Speech models had 512 states
• Noise models had 2 states
• Noisy log spectra were temporally smoothed prior to processing to reduce jitter.
![Page 18: Model-Based Fusion of Bone and Air Sensors for Speech Enhancement and Robust Speech Recognition John Hershey, Trausti Kristjansson, Zhengyou Zhang, Alex](https://reader033.vdocuments.us/reader033/viewer/2022051516/56649d365503460f94a0e513/html5/thumbnails/18.jpg)
Test Cases
• Sensor conditions: – Audio only speech model (A)– Audio plus Bone speech model (AB)– AB with state posteriors inferred from bone
signal only (AB+)
• Adaptation conditions– Use speech detection on bone signal to train
noise model (Detect)– Use EM algorithm to adapt noise model (Adapt)
![Page 19: Model-Based Fusion of Bone and Air Sensors for Speech Enhancement and Robust Speech Recognition John Hershey, Trausti Kristjansson, Zhengyou Zhang, Alex](https://reader033.vdocuments.us/reader033/viewer/2022051516/56649d365503460f94a0e513/html5/thumbnails/19.jpg)
Results
Adaptation Helps
Inferring states from bone mic only
Inferring states from both bone and air mics
Inference only using air mic
Air 57.2%
Bone 28.4%
Original Noisy Signals
Clean: 92.7%
![Page 20: Model-Based Fusion of Bone and Air Sensors for Speech Enhancement and Robust Speech Recognition John Hershey, Trausti Kristjansson, Zhengyou Zhang, Alex](https://reader033.vdocuments.us/reader033/viewer/2022051516/56649d365503460f94a0e513/html5/thumbnails/20.jpg)
Future Directions
• We have shown a strong potential for model-based noise adaptation in concert with a bone sensor.
• Improve the speed of such systems by employing sequential approximations.
• Incorporate state-conditional correlations between air and bone signals.
• Deal with phase dependencies.