spatio-temporal analysis of multimodal speaker activity

54
Spatio-Temporal Analysis of Multimodal Speaker Activity Guillaume Lathoud, IDIAP Supervised by Dr Iain McCowan, IDIAP

Upload: hedy-anthony

Post on 30-Dec-2015

18 views

Category:

Documents


1 download

DESCRIPTION

Spatio-Temporal Analysis of Multimodal Speaker Activity. Guillaume Lathoud, IDIAP Supervised by Dr Iain McCowan, IDIAP. Context. Spontaneous multi-party speech. Goal: extract salient information: Who? What? When? Where? Automatic meeting annotation/transcription. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Spatio-Temporal Analysis of Multimodal Speaker Activity

Spatio-Temporal Analysis of Multimodal Speaker Activity

Guillaume Lathoud, IDIAP

Supervised by Dr Iain McCowan, IDIAP

Page 2: Spatio-Temporal Analysis of Multimodal Speaker Activity

Context

• Spontaneous multi-party speech.

• Goal: extract salient information:– Who? What? When? Where?– Automatic meeting annotation/transcription.– Speaker tracking, speech acquisition.– Surveillance.

Page 3: Spatio-Temporal Analysis of Multimodal Speaker Activity

Context

• Spontaneous multi-party speech.

• Goal: extract salient information:– Who? What? When? Where?– Automatic meeting annotation/transcription.– Speaker tracking, speech acquisition.– Surveillance.

• Approach: based on speaker location.

Page 4: Spatio-Temporal Analysis of Multimodal Speaker Activity

Context

• Spontaneous multi-party speech.

• Goal: extract salient information:– Who? What? When? Where?– Automatic meeting annotation/transcription.– Speaker tracking, speech acquisition.– Surveillance.

• Approach: based on speaker location.

• Multisource problem (overlaps, noise).

Page 5: Spatio-Temporal Analysis of Multimodal Speaker Activity
Page 6: Spatio-Temporal Analysis of Multimodal Speaker Activity

How?

• Audio location: microphone array

• Audio content: speaker identification.• Video: one or several cameras.• Combination.

Page 7: Spatio-Temporal Analysis of Multimodal Speaker Activity

How?

• Audio location: microphone array

• Audio content: speaker identification.• Video: one or several cameras.• Combination.

Page 8: Spatio-Temporal Analysis of Multimodal Speaker Activity

Audio: Global PictureMultiple waveforms

Resolution 0.0000625 s

Task 1: data acquisition and annotation (complete)

Page 9: Spatio-Temporal Analysis of Multimodal Speaker Activity

Audio: Global PictureMultiple waveforms

Resolution 0.0000625 s

Task 1: data acquisition and annotation (complete)

Task 2: develop robust multisource strategies (in progress)

Active speakers’ locations

Resolution 0.016 s

Page 10: Spatio-Temporal Analysis of Multimodal Speaker Activity

Audio: Global PictureMultiple waveforms

Resolution 0.0000625 s

Task 1: data acquisition and annotation (complete)

Task 2: develop robust multisource strategies (in progress)

Active speakers’ locations

Resolution 0.016 s

Link locations across space and time

Resolution 0.016 s

Links up to 0.25 s

Task 3: joint segmentation and tracking (complete)

Page 11: Spatio-Temporal Analysis of Multimodal Speaker Activity

Audio: Global PictureMultiple waveforms

Resolution 0.0000625 s

Task 1: data acquisition and annotation (complete)

Task 2: develop robust multisource strategies (in progress)

Active speakers’ locations

Resolution 0.016 s

Link locations across space and time

Resolution 0.016 s

Links up to 0.25 s

Task 3: joint segmentation and tracking (complete)

Page 12: Spatio-Temporal Analysis of Multimodal Speaker Activity

Task 1: AV16.3 corpus

• At IDIAP: 16 microphones, 3 cameras.

• 40 short recordings, about 1h30 overall.– ‘meeting’: seated.– ‘surveillance’: standing.– pathological test cases (A, V, AV).

Page 13: Spatio-Temporal Analysis of Multimodal Speaker Activity

Task 1: AV16.3 corpus

• At IDIAP: 16 microphones, 3 cameras.

• 40 short recordings, about 1h30 overall.– ‘meeting’: seated.– ‘surveillance’: standing.– pathological test cases (A, V, AV).

• 3D mouth annotation.

• Used in the AMI project.

• http://mmm.idiap.ch/Lathoud/av16.3_v6

Page 14: Spatio-Temporal Analysis of Multimodal Speaker Activity

AV16.3 corpus

Page 15: Spatio-Temporal Analysis of Multimodal Speaker Activity
Page 16: Spatio-Temporal Analysis of Multimodal Speaker Activity

Audio: Global PictureMultiple waveforms

Resolution 0.0000625 s

Task 1: data acquisition and annotation (complete)

Task 2: develop robust multisource strategies (in progress)

Active speakers’ locations

Resolution 0.016 s

Link locations across space and time

Resolution 0.016 s

Links up to 0.25 s

Task 3: joint segmentation and tracking (complete)

Page 17: Spatio-Temporal Analysis of Multimodal Speaker Activity

Task 2: Multisource Localization

• Problem: – Detect: how many speakers?– Localize: where?

Page 18: Spatio-Temporal Analysis of Multimodal Speaker Activity

Sector-based Approach

Question: is there at least one active source in a given sector?

Page 19: Spatio-Temporal Analysis of Multimodal Speaker Activity

Task 2: Multisource Localization

• Problem: – Detect: how many speakers?– Localize: where?

• Sectors (coarse-to-fine).

Page 20: Spatio-Temporal Analysis of Multimodal Speaker Activity

Task 2: Multisource Localization

• Problem: – Detect: how many speakers?– Localize: where?

• Sectors (coarse-to-fine).

• Tested on real data: AV16.3 corpus.

• To do:– Finalize (optimization, multi-level).– Compare with existing.

Page 21: Spatio-Temporal Analysis of Multimodal Speaker Activity

Task 2: Single Speaker Example

Page 22: Spatio-Temporal Analysis of Multimodal Speaker Activity

Task 2: Multiple Loudspeakers

Metric Ideal Result

>=1 detected 100%

Average

nb detected

2.0

2 loudspeakers simultaneously active

Page 23: Spatio-Temporal Analysis of Multimodal Speaker Activity

Task 2: Multiple Loudspeakers

Metric Ideal Result

>=1 detected 100% 100%

Average

nb detected

2.0 1.9

2 loudspeakers simultaneously active

Page 24: Spatio-Temporal Analysis of Multimodal Speaker Activity

Task 2: Multiple Loudspeakers

Metric Ideal Result

>=1 detected 100% 100%

Average

nb detected

2.0 1.9

>=1 detected 100% 99.8%

Average

nb detected

3.0 2.5

3 loudspeakers simultaneously active

Page 25: Spatio-Temporal Analysis of Multimodal Speaker Activity

Real data: Humans

Metric Ideal Result

>=1 detected ~89.4% 90.8%

Average

nb detected

~1.3 1.3

2 speakers simultaneously active (includes short silences)

Page 26: Spatio-Temporal Analysis of Multimodal Speaker Activity

Real data: Humans

Metric Ideal Result

>=1 detected ~89.4% 90.8%

Average

nb detected

~1.3 1.3

3 speakers simultaneously active (includes short silences)

>=1 detected ~96.5% 95.1%

Average

nb detected

~2.0 1.6

Page 27: Spatio-Temporal Analysis of Multimodal Speaker Activity

Audio: Global PictureMultiple waveforms

Resolution 0.0000625 s

Task 1: data acquisition and annotation (complete)

Task 2: develop robust multisource strategies (in progress)

Active speakers’ locations

Resolution 0.016 s

Link locations across space and time

Resolution 0.016 s

Links up to 0.25 s

Task 3: joint segmentation and tracking (complete)

Page 28: Spatio-Temporal Analysis of Multimodal Speaker Activity

Task 3: Segmentation/Tracking

• Speech:– Short and sporadic utterances.– Overlaps.– Filtering is difficult (Kalman, PF).

Page 29: Spatio-Temporal Analysis of Multimodal Speaker Activity

Task 3: Segmentation/Tracking

• Speech:– Short and sporadic utterances.– Overlaps.– Filtering is difficult (Kalman, PF).

• Alternative: short-term clustering.

Page 30: Spatio-Temporal Analysis of Multimodal Speaker Activity

Task 3: Segmentation/Tracking

• Speech:– Short and sporadic utterances.– Overlaps.– Filtering is difficult (Kalman, PF).

• Alternative: short-term clustering.

• Short-term = 0.25 s.

• Threshold-free, online, unsupervised.

• Unknown number of objects.

Page 31: Spatio-Temporal Analysis of Multimodal Speaker Activity

Example: iteration 1 (partition)

Page 32: Spatio-Temporal Analysis of Multimodal Speaker Activity

Example: iteration 1 (merge)

Page 33: Spatio-Temporal Analysis of Multimodal Speaker Activity

Example: iteration 2 (partition)

Page 34: Spatio-Temporal Analysis of Multimodal Speaker Activity

Example: iteration 2 (merge)

Page 35: Spatio-Temporal Analysis of Multimodal Speaker Activity

Example: iteration 3 (partition)

Page 36: Spatio-Temporal Analysis of Multimodal Speaker Activity

Example: iteration 3 (merge)

Page 37: Spatio-Temporal Analysis of Multimodal Speaker Activity

Example: iteration 4 (partition)

Page 38: Spatio-Temporal Analysis of Multimodal Speaker Activity

Example: iteration 4 (merge)

Page 39: Spatio-Temporal Analysis of Multimodal Speaker Activity

Example: result

Page 40: Spatio-Temporal Analysis of Multimodal Speaker Activity

Example: result

Page 41: Spatio-Temporal Analysis of Multimodal Speaker Activity

Example: result

Page 42: Spatio-Temporal Analysis of Multimodal Speaker Activity

Task 3: Application

Link locations across space and time

Resolution 0.016 s

Links up to 0.25 s

Task 3: joint segmentation and tracking (complete)

Annotated IDIAP corpus of short meetings (total 1h45) http://mmm.idiap.ch

Single source localization

Page 43: Spatio-Temporal Analysis of Multimodal Speaker Activity

Application (2)

Page 44: Spatio-Temporal Analysis of Multimodal Speaker Activity

•Precision (PRC):–An active speaker is detected in the result.–PRC = probability that he is truly active.

Task 3: Metrics

Page 45: Spatio-Temporal Analysis of Multimodal Speaker Activity

•Precision (PRC):–An active speaker is detected in the result.–PRC = probability that he is truly active.

•Recall (RCL):–A speaker is truly active.–RCL = probability to detect him in the result.

Task 3: Metrics

Page 46: Spatio-Temporal Analysis of Multimodal Speaker Activity

•Precision (PRC):–An active speaker is detected in the result.–PRC = probability that he is truly active.

•Recall (RCL):–A speaker is truly active.–RCL = probability to detect him in the result.

•F-measure: F = 2 * PRC * RCL

PRC + RCL

Task 3: Metrics

Page 47: Spatio-Temporal Analysis of Multimodal Speaker Activity

Task 3: Results

Entire data:

Proposed Lapel baseline

PRC 79.7% 84.3%

RCL 94.6% 93.3%

F 86.5% 88.6%F = 2 * PRC * RCL

PRC + RCL

Page 48: Spatio-Temporal Analysis of Multimodal Speaker Activity

Task 3: Results

Entire data:

Proposed Lapel baseline

PRC 79.7% 84.3%

RCL 94.6% 93.3%

F 86.5% 88.6%

Overlaps only:

Proposed Lapel baseline

PRC 55.4% 46.6%

RCL 84.8% 66.4%

F 67.0% 54.7%

F = 2 * PRC * RCL

PRC + RCL

Page 49: Spatio-Temporal Analysis of Multimodal Speaker Activity

Conclusion

• Spontaneous speech = multisource problem.

• AV16.3 corpus recorded, annotated.

• Approach: detect, localize, track, segment.

• Location is not identity!– Fusion with monochannel analysis.– Fusion with video.

Page 50: Spatio-Temporal Analysis of Multimodal Speaker Activity

Thank you!

Page 51: Spatio-Temporal Analysis of Multimodal Speaker Activity

Detection: Energy and Localization

Page 52: Spatio-Temporal Analysis of Multimodal Speaker Activity

Task 2: Delay-sum vs Proposed (1/3)

With optimized centroids (this work)

With delay-sum centroids (this work)

Page 53: Spatio-Temporal Analysis of Multimodal Speaker Activity

Task 2: Delay-sum vs Proposed (2/3)

Metric Ideal Delay-sum Proposed

>=1 detected 100% 99.9% 100%

Average nb detected

2.0 1.8 1.9

2 loudspeakers simultaneously active

>=1 detected 100% 99.2% 99.8%

Average nb detected

3.0 1.9 2.5

3 loudspeakers simultaneously active

Page 54: Spatio-Temporal Analysis of Multimodal Speaker Activity

Task 2: Delay-sum vs Proposed (3/3)

Metric Ideal Delay-sum Proposed

>=1 detected ~89.4% 80.0% 90.8%

Average nb detected

~1.3 1.0 1.3

2 humans simultaneously active

>=1 detected ~96.5% 86.7% 95.1%

Average nb detected

~2.0 1.4 1.6

3 humans simultaneously active