![Page 1: Spatio-Temporal Analysis of Multimodal Speaker Activity](https://reader030.vdocuments.us/reader030/viewer/2022032708/56812a47550346895d8d83f7/html5/thumbnails/1.jpg)
Spatio-Temporal Analysis of Multimodal Speaker Activity
Guillaume Lathoud, IDIAP
Supervised by Dr Iain McCowan, IDIAP
![Page 2: Spatio-Temporal Analysis of Multimodal Speaker Activity](https://reader030.vdocuments.us/reader030/viewer/2022032708/56812a47550346895d8d83f7/html5/thumbnails/2.jpg)
Context
• Spontaneous multi-party speech.
• Goal: extract salient information:– Who? What? When? Where?– Automatic meeting annotation/transcription.– Speaker tracking, speech acquisition.– Surveillance.
![Page 3: Spatio-Temporal Analysis of Multimodal Speaker Activity](https://reader030.vdocuments.us/reader030/viewer/2022032708/56812a47550346895d8d83f7/html5/thumbnails/3.jpg)
Context
• Spontaneous multi-party speech.
• Goal: extract salient information:– Who? What? When? Where?– Automatic meeting annotation/transcription.– Speaker tracking, speech acquisition.– Surveillance.
• Approach: based on speaker location.
![Page 4: Spatio-Temporal Analysis of Multimodal Speaker Activity](https://reader030.vdocuments.us/reader030/viewer/2022032708/56812a47550346895d8d83f7/html5/thumbnails/4.jpg)
Context
• Spontaneous multi-party speech.
• Goal: extract salient information:– Who? What? When? Where?– Automatic meeting annotation/transcription.– Speaker tracking, speech acquisition.– Surveillance.
• Approach: based on speaker location.
• Multisource problem (overlaps, noise).
![Page 5: Spatio-Temporal Analysis of Multimodal Speaker Activity](https://reader030.vdocuments.us/reader030/viewer/2022032708/56812a47550346895d8d83f7/html5/thumbnails/5.jpg)
![Page 6: Spatio-Temporal Analysis of Multimodal Speaker Activity](https://reader030.vdocuments.us/reader030/viewer/2022032708/56812a47550346895d8d83f7/html5/thumbnails/6.jpg)
How?
• Audio location: microphone array
• Audio content: speaker identification.• Video: one or several cameras.• Combination.
![Page 7: Spatio-Temporal Analysis of Multimodal Speaker Activity](https://reader030.vdocuments.us/reader030/viewer/2022032708/56812a47550346895d8d83f7/html5/thumbnails/7.jpg)
How?
• Audio location: microphone array
• Audio content: speaker identification.• Video: one or several cameras.• Combination.
![Page 8: Spatio-Temporal Analysis of Multimodal Speaker Activity](https://reader030.vdocuments.us/reader030/viewer/2022032708/56812a47550346895d8d83f7/html5/thumbnails/8.jpg)
Audio: Global PictureMultiple waveforms
Resolution 0.0000625 s
Task 1: data acquisition and annotation (complete)
![Page 9: Spatio-Temporal Analysis of Multimodal Speaker Activity](https://reader030.vdocuments.us/reader030/viewer/2022032708/56812a47550346895d8d83f7/html5/thumbnails/9.jpg)
Audio: Global PictureMultiple waveforms
Resolution 0.0000625 s
Task 1: data acquisition and annotation (complete)
Task 2: develop robust multisource strategies (in progress)
Active speakers’ locations
Resolution 0.016 s
![Page 10: Spatio-Temporal Analysis of Multimodal Speaker Activity](https://reader030.vdocuments.us/reader030/viewer/2022032708/56812a47550346895d8d83f7/html5/thumbnails/10.jpg)
Audio: Global PictureMultiple waveforms
Resolution 0.0000625 s
Task 1: data acquisition and annotation (complete)
Task 2: develop robust multisource strategies (in progress)
Active speakers’ locations
Resolution 0.016 s
Link locations across space and time
Resolution 0.016 s
Links up to 0.25 s
Task 3: joint segmentation and tracking (complete)
![Page 11: Spatio-Temporal Analysis of Multimodal Speaker Activity](https://reader030.vdocuments.us/reader030/viewer/2022032708/56812a47550346895d8d83f7/html5/thumbnails/11.jpg)
Audio: Global PictureMultiple waveforms
Resolution 0.0000625 s
Task 1: data acquisition and annotation (complete)
Task 2: develop robust multisource strategies (in progress)
Active speakers’ locations
Resolution 0.016 s
Link locations across space and time
Resolution 0.016 s
Links up to 0.25 s
Task 3: joint segmentation and tracking (complete)
![Page 12: Spatio-Temporal Analysis of Multimodal Speaker Activity](https://reader030.vdocuments.us/reader030/viewer/2022032708/56812a47550346895d8d83f7/html5/thumbnails/12.jpg)
Task 1: AV16.3 corpus
• At IDIAP: 16 microphones, 3 cameras.
• 40 short recordings, about 1h30 overall.– ‘meeting’: seated.– ‘surveillance’: standing.– pathological test cases (A, V, AV).
![Page 13: Spatio-Temporal Analysis of Multimodal Speaker Activity](https://reader030.vdocuments.us/reader030/viewer/2022032708/56812a47550346895d8d83f7/html5/thumbnails/13.jpg)
Task 1: AV16.3 corpus
• At IDIAP: 16 microphones, 3 cameras.
• 40 short recordings, about 1h30 overall.– ‘meeting’: seated.– ‘surveillance’: standing.– pathological test cases (A, V, AV).
• 3D mouth annotation.
• Used in the AMI project.
• http://mmm.idiap.ch/Lathoud/av16.3_v6
![Page 14: Spatio-Temporal Analysis of Multimodal Speaker Activity](https://reader030.vdocuments.us/reader030/viewer/2022032708/56812a47550346895d8d83f7/html5/thumbnails/14.jpg)
AV16.3 corpus
![Page 15: Spatio-Temporal Analysis of Multimodal Speaker Activity](https://reader030.vdocuments.us/reader030/viewer/2022032708/56812a47550346895d8d83f7/html5/thumbnails/15.jpg)
![Page 16: Spatio-Temporal Analysis of Multimodal Speaker Activity](https://reader030.vdocuments.us/reader030/viewer/2022032708/56812a47550346895d8d83f7/html5/thumbnails/16.jpg)
Audio: Global PictureMultiple waveforms
Resolution 0.0000625 s
Task 1: data acquisition and annotation (complete)
Task 2: develop robust multisource strategies (in progress)
Active speakers’ locations
Resolution 0.016 s
Link locations across space and time
Resolution 0.016 s
Links up to 0.25 s
Task 3: joint segmentation and tracking (complete)
![Page 17: Spatio-Temporal Analysis of Multimodal Speaker Activity](https://reader030.vdocuments.us/reader030/viewer/2022032708/56812a47550346895d8d83f7/html5/thumbnails/17.jpg)
Task 2: Multisource Localization
• Problem: – Detect: how many speakers?– Localize: where?
![Page 18: Spatio-Temporal Analysis of Multimodal Speaker Activity](https://reader030.vdocuments.us/reader030/viewer/2022032708/56812a47550346895d8d83f7/html5/thumbnails/18.jpg)
Sector-based Approach
Question: is there at least one active source in a given sector?
![Page 19: Spatio-Temporal Analysis of Multimodal Speaker Activity](https://reader030.vdocuments.us/reader030/viewer/2022032708/56812a47550346895d8d83f7/html5/thumbnails/19.jpg)
Task 2: Multisource Localization
• Problem: – Detect: how many speakers?– Localize: where?
• Sectors (coarse-to-fine).
![Page 20: Spatio-Temporal Analysis of Multimodal Speaker Activity](https://reader030.vdocuments.us/reader030/viewer/2022032708/56812a47550346895d8d83f7/html5/thumbnails/20.jpg)
Task 2: Multisource Localization
• Problem: – Detect: how many speakers?– Localize: where?
• Sectors (coarse-to-fine).
• Tested on real data: AV16.3 corpus.
• To do:– Finalize (optimization, multi-level).– Compare with existing.
![Page 21: Spatio-Temporal Analysis of Multimodal Speaker Activity](https://reader030.vdocuments.us/reader030/viewer/2022032708/56812a47550346895d8d83f7/html5/thumbnails/21.jpg)
Task 2: Single Speaker Example
![Page 22: Spatio-Temporal Analysis of Multimodal Speaker Activity](https://reader030.vdocuments.us/reader030/viewer/2022032708/56812a47550346895d8d83f7/html5/thumbnails/22.jpg)
Task 2: Multiple Loudspeakers
Metric Ideal Result
>=1 detected 100%
Average
nb detected
2.0
2 loudspeakers simultaneously active
![Page 23: Spatio-Temporal Analysis of Multimodal Speaker Activity](https://reader030.vdocuments.us/reader030/viewer/2022032708/56812a47550346895d8d83f7/html5/thumbnails/23.jpg)
Task 2: Multiple Loudspeakers
Metric Ideal Result
>=1 detected 100% 100%
Average
nb detected
2.0 1.9
2 loudspeakers simultaneously active
![Page 24: Spatio-Temporal Analysis of Multimodal Speaker Activity](https://reader030.vdocuments.us/reader030/viewer/2022032708/56812a47550346895d8d83f7/html5/thumbnails/24.jpg)
Task 2: Multiple Loudspeakers
Metric Ideal Result
>=1 detected 100% 100%
Average
nb detected
2.0 1.9
>=1 detected 100% 99.8%
Average
nb detected
3.0 2.5
3 loudspeakers simultaneously active
![Page 25: Spatio-Temporal Analysis of Multimodal Speaker Activity](https://reader030.vdocuments.us/reader030/viewer/2022032708/56812a47550346895d8d83f7/html5/thumbnails/25.jpg)
Real data: Humans
Metric Ideal Result
>=1 detected ~89.4% 90.8%
Average
nb detected
~1.3 1.3
2 speakers simultaneously active (includes short silences)
![Page 26: Spatio-Temporal Analysis of Multimodal Speaker Activity](https://reader030.vdocuments.us/reader030/viewer/2022032708/56812a47550346895d8d83f7/html5/thumbnails/26.jpg)
Real data: Humans
Metric Ideal Result
>=1 detected ~89.4% 90.8%
Average
nb detected
~1.3 1.3
3 speakers simultaneously active (includes short silences)
>=1 detected ~96.5% 95.1%
Average
nb detected
~2.0 1.6
![Page 27: Spatio-Temporal Analysis of Multimodal Speaker Activity](https://reader030.vdocuments.us/reader030/viewer/2022032708/56812a47550346895d8d83f7/html5/thumbnails/27.jpg)
Audio: Global PictureMultiple waveforms
Resolution 0.0000625 s
Task 1: data acquisition and annotation (complete)
Task 2: develop robust multisource strategies (in progress)
Active speakers’ locations
Resolution 0.016 s
Link locations across space and time
Resolution 0.016 s
Links up to 0.25 s
Task 3: joint segmentation and tracking (complete)
![Page 28: Spatio-Temporal Analysis of Multimodal Speaker Activity](https://reader030.vdocuments.us/reader030/viewer/2022032708/56812a47550346895d8d83f7/html5/thumbnails/28.jpg)
Task 3: Segmentation/Tracking
• Speech:– Short and sporadic utterances.– Overlaps.– Filtering is difficult (Kalman, PF).
![Page 29: Spatio-Temporal Analysis of Multimodal Speaker Activity](https://reader030.vdocuments.us/reader030/viewer/2022032708/56812a47550346895d8d83f7/html5/thumbnails/29.jpg)
Task 3: Segmentation/Tracking
• Speech:– Short and sporadic utterances.– Overlaps.– Filtering is difficult (Kalman, PF).
• Alternative: short-term clustering.
![Page 30: Spatio-Temporal Analysis of Multimodal Speaker Activity](https://reader030.vdocuments.us/reader030/viewer/2022032708/56812a47550346895d8d83f7/html5/thumbnails/30.jpg)
Task 3: Segmentation/Tracking
• Speech:– Short and sporadic utterances.– Overlaps.– Filtering is difficult (Kalman, PF).
• Alternative: short-term clustering.
• Short-term = 0.25 s.
• Threshold-free, online, unsupervised.
• Unknown number of objects.
![Page 31: Spatio-Temporal Analysis of Multimodal Speaker Activity](https://reader030.vdocuments.us/reader030/viewer/2022032708/56812a47550346895d8d83f7/html5/thumbnails/31.jpg)
Example: iteration 1 (partition)
![Page 32: Spatio-Temporal Analysis of Multimodal Speaker Activity](https://reader030.vdocuments.us/reader030/viewer/2022032708/56812a47550346895d8d83f7/html5/thumbnails/32.jpg)
Example: iteration 1 (merge)
![Page 33: Spatio-Temporal Analysis of Multimodal Speaker Activity](https://reader030.vdocuments.us/reader030/viewer/2022032708/56812a47550346895d8d83f7/html5/thumbnails/33.jpg)
Example: iteration 2 (partition)
![Page 34: Spatio-Temporal Analysis of Multimodal Speaker Activity](https://reader030.vdocuments.us/reader030/viewer/2022032708/56812a47550346895d8d83f7/html5/thumbnails/34.jpg)
Example: iteration 2 (merge)
![Page 35: Spatio-Temporal Analysis of Multimodal Speaker Activity](https://reader030.vdocuments.us/reader030/viewer/2022032708/56812a47550346895d8d83f7/html5/thumbnails/35.jpg)
Example: iteration 3 (partition)
![Page 36: Spatio-Temporal Analysis of Multimodal Speaker Activity](https://reader030.vdocuments.us/reader030/viewer/2022032708/56812a47550346895d8d83f7/html5/thumbnails/36.jpg)
Example: iteration 3 (merge)
![Page 37: Spatio-Temporal Analysis of Multimodal Speaker Activity](https://reader030.vdocuments.us/reader030/viewer/2022032708/56812a47550346895d8d83f7/html5/thumbnails/37.jpg)
Example: iteration 4 (partition)
![Page 38: Spatio-Temporal Analysis of Multimodal Speaker Activity](https://reader030.vdocuments.us/reader030/viewer/2022032708/56812a47550346895d8d83f7/html5/thumbnails/38.jpg)
Example: iteration 4 (merge)
![Page 39: Spatio-Temporal Analysis of Multimodal Speaker Activity](https://reader030.vdocuments.us/reader030/viewer/2022032708/56812a47550346895d8d83f7/html5/thumbnails/39.jpg)
Example: result
![Page 40: Spatio-Temporal Analysis of Multimodal Speaker Activity](https://reader030.vdocuments.us/reader030/viewer/2022032708/56812a47550346895d8d83f7/html5/thumbnails/40.jpg)
Example: result
![Page 41: Spatio-Temporal Analysis of Multimodal Speaker Activity](https://reader030.vdocuments.us/reader030/viewer/2022032708/56812a47550346895d8d83f7/html5/thumbnails/41.jpg)
Example: result
![Page 42: Spatio-Temporal Analysis of Multimodal Speaker Activity](https://reader030.vdocuments.us/reader030/viewer/2022032708/56812a47550346895d8d83f7/html5/thumbnails/42.jpg)
Task 3: Application
Link locations across space and time
Resolution 0.016 s
Links up to 0.25 s
Task 3: joint segmentation and tracking (complete)
Annotated IDIAP corpus of short meetings (total 1h45) http://mmm.idiap.ch
Single source localization
![Page 43: Spatio-Temporal Analysis of Multimodal Speaker Activity](https://reader030.vdocuments.us/reader030/viewer/2022032708/56812a47550346895d8d83f7/html5/thumbnails/43.jpg)
Application (2)
![Page 44: Spatio-Temporal Analysis of Multimodal Speaker Activity](https://reader030.vdocuments.us/reader030/viewer/2022032708/56812a47550346895d8d83f7/html5/thumbnails/44.jpg)
•Precision (PRC):–An active speaker is detected in the result.–PRC = probability that he is truly active.
Task 3: Metrics
![Page 45: Spatio-Temporal Analysis of Multimodal Speaker Activity](https://reader030.vdocuments.us/reader030/viewer/2022032708/56812a47550346895d8d83f7/html5/thumbnails/45.jpg)
•Precision (PRC):–An active speaker is detected in the result.–PRC = probability that he is truly active.
•Recall (RCL):–A speaker is truly active.–RCL = probability to detect him in the result.
Task 3: Metrics
![Page 46: Spatio-Temporal Analysis of Multimodal Speaker Activity](https://reader030.vdocuments.us/reader030/viewer/2022032708/56812a47550346895d8d83f7/html5/thumbnails/46.jpg)
•Precision (PRC):–An active speaker is detected in the result.–PRC = probability that he is truly active.
•Recall (RCL):–A speaker is truly active.–RCL = probability to detect him in the result.
•F-measure: F = 2 * PRC * RCL
PRC + RCL
Task 3: Metrics
![Page 47: Spatio-Temporal Analysis of Multimodal Speaker Activity](https://reader030.vdocuments.us/reader030/viewer/2022032708/56812a47550346895d8d83f7/html5/thumbnails/47.jpg)
Task 3: Results
Entire data:
Proposed Lapel baseline
PRC 79.7% 84.3%
RCL 94.6% 93.3%
F 86.5% 88.6%F = 2 * PRC * RCL
PRC + RCL
![Page 48: Spatio-Temporal Analysis of Multimodal Speaker Activity](https://reader030.vdocuments.us/reader030/viewer/2022032708/56812a47550346895d8d83f7/html5/thumbnails/48.jpg)
Task 3: Results
Entire data:
Proposed Lapel baseline
PRC 79.7% 84.3%
RCL 94.6% 93.3%
F 86.5% 88.6%
Overlaps only:
Proposed Lapel baseline
PRC 55.4% 46.6%
RCL 84.8% 66.4%
F 67.0% 54.7%
F = 2 * PRC * RCL
PRC + RCL
![Page 49: Spatio-Temporal Analysis of Multimodal Speaker Activity](https://reader030.vdocuments.us/reader030/viewer/2022032708/56812a47550346895d8d83f7/html5/thumbnails/49.jpg)
Conclusion
• Spontaneous speech = multisource problem.
• AV16.3 corpus recorded, annotated.
• Approach: detect, localize, track, segment.
• Location is not identity!– Fusion with monochannel analysis.– Fusion with video.
![Page 50: Spatio-Temporal Analysis of Multimodal Speaker Activity](https://reader030.vdocuments.us/reader030/viewer/2022032708/56812a47550346895d8d83f7/html5/thumbnails/50.jpg)
Thank you!
![Page 51: Spatio-Temporal Analysis of Multimodal Speaker Activity](https://reader030.vdocuments.us/reader030/viewer/2022032708/56812a47550346895d8d83f7/html5/thumbnails/51.jpg)
Detection: Energy and Localization
![Page 52: Spatio-Temporal Analysis of Multimodal Speaker Activity](https://reader030.vdocuments.us/reader030/viewer/2022032708/56812a47550346895d8d83f7/html5/thumbnails/52.jpg)
Task 2: Delay-sum vs Proposed (1/3)
With optimized centroids (this work)
With delay-sum centroids (this work)
![Page 53: Spatio-Temporal Analysis of Multimodal Speaker Activity](https://reader030.vdocuments.us/reader030/viewer/2022032708/56812a47550346895d8d83f7/html5/thumbnails/53.jpg)
Task 2: Delay-sum vs Proposed (2/3)
Metric Ideal Delay-sum Proposed
>=1 detected 100% 99.9% 100%
Average nb detected
2.0 1.8 1.9
2 loudspeakers simultaneously active
>=1 detected 100% 99.2% 99.8%
Average nb detected
3.0 1.9 2.5
3 loudspeakers simultaneously active
![Page 54: Spatio-Temporal Analysis of Multimodal Speaker Activity](https://reader030.vdocuments.us/reader030/viewer/2022032708/56812a47550346895d8d83f7/html5/thumbnails/54.jpg)
Task 2: Delay-sum vs Proposed (3/3)
Metric Ideal Delay-sum Proposed
>=1 detected ~89.4% 80.0% 90.8%
Average nb detected
~1.3 1.0 1.3
2 humans simultaneously active
>=1 detected ~96.5% 86.7% 95.1%
Average nb detected
~2.0 1.4 1.6
3 humans simultaneously active