macquarie rt05s speaker diarisation system steve cassidy centre for language technology macquarie...

Macquarie RT05s Speaker Diarisation System

Steve Cassidy

Centre for Language TechnologyMacquarie University

Sydney

2©19 Apr 2023 Macquarie University

System Goals

• Develop a simple end-to-end system for the SPKR task

• Platform for experimentation • Improve on RT04s system


Overall Results

0

10

20

30

40

50

60

70

80

90

AMI AMI CMU CMU ICSI ICSI NIST NIST VT VT


System OverviewFeature Extractio

n

SAD

Segmentation

Turn Clusterin

g

Speaker ID

• Single Distant Microphone• Implemented in C and Tcl• Runs in around 6x real time on

single AMD64 • Developed with RT04 devtest

data– No AMI or VT data seen

before eval


Feature ExtractionFeature Extractio

n

Feature Extractio

n

SAD

Segmentation

Turn Clusterin

g

Speaker ID

• 26 coefficients:– 12 MFCC– RMS Energy– Delta Coefficients

• 10ms frame rate, 25.6ms window

• Mean subtraction based on mean of first 60 seconds of file

• Uses the KTH Snack toolkit


Speech Activity DetectionFeature Extractio

n

SADSAD

Segmentation

Turn Clusterin

g

Speaker ID

• Goal: find obvious regions of non-speech for gross segmentation of recording

• GMMs for speech and non-speech– Speech model: 32 mixtures– Non-speech model: 8 mixtures

• Trained on RT04s devtest data set– Reference labels generated from

speaker labelling– Ignored silence regions < 0.3s



n

SADSAD

Segmentation

Turn Clusterin

g

Speaker ID

• Evaluate frame classification error (%):Dataset NSPER SPER

RT04s unseen 32 19

RT05s 47 15



n

SADSAD

Segmentation

Turn Clusterin

g

Speaker ID

• SAD is performed by classifying successive windows of 10 frames using the GMM models

• Consecutive regions are merged and labelled

• Non-speech < 0.35s merged with following segment

• Speech < 0.15s merged with following non-speech



n

SADSAD

Segmentation

Turn Clusterin

g

Speaker ID

• Evaluation– Frame classification error– Boundaries missed

– nothing within 0.5s

– Boundaries inserted inside real segments

Meeting

Frame Error

%

Boundary Error

% # Auto

NSPER

SPER

Miss FP

CMU 1415

89 7 91 77 45

ICSI 1100

99 4 85 88 99

NIST 0939

71 9 83 84 97

AMI 1206 43 18 25 79 348VT 1430 100 0 99 50 2


Turn Segmentation Feature Extractio

n

SAD

Segmentation

Segmentation

Turn Clusterin

g

Speaker ID

• Speech regions are segmented using BIC criterion

• Compare fit of single gaussian model of sequence with pair of models each side of break

• Fixed windows of 200 frames advanced over speech region

• Peaks in delta BIC curve indicate change points


Turn Segmentation Feature Extractio

n

SAD

Segmentation

Segmentation

Turn Clusterin

g

Speaker ID

0 50 100

CMU/98

ICSI/198

NIST/257

AMI/427

VT/168

% Error

FPMiss


Turn ClusteringFeature Extractio

n

SAD

Segmentation

Turn Clusterin

g

Turn Clusterin

g

Speaker ID

• Given a set of speaker turns, find natural clusters

• Number of clusters unknown• Requires:

– Distance metric on speaker turns

– Clustering algorithm– Cluster evaluation metric


Speaker Similarity

Mean + variance of feature vectorsK-L distance metric


Turn ClusteringFeature Extractio

n

SAD

Segmentation

Turn Clusterin

g

Turn Clusterin

g

Speaker ID

• Implementation:– Select segments longer than

1.5s for clustering– KL distance on mean/variance of

features– Hierarchical clustering – Select labellings for 2, 3…N

speakers– Cluster evaluation performed

after speaker ID


Speaker IDFeature Extractio

n

SAD

Segmentation

Turn Clusterin

g

Speaker ID

Speaker ID

• Use cluster labelled turns to train speaker models– 32 mixture GMM

• Now classify and re-label all speaker turns

• Potentially correct poor clustering decisions

• Very small amounts of data to support models


Overall Results

0

10

20

30

40

50

60

70

80

90

AMI AMI CMU CMU ICSI ICSI NIST NIST VT VT


What Didn’t Work

• Inter-channel phase and level differences

• Exemplar speaker models• SVD based turn clustering

– Find similar groups by factoring the distance matrix

– One product of SVD is a number of clusters

macquarie rt05s speaker diarisation system steve cassidy centre for language technology macquarie...

Documents

clustering speaker id

clustering speaker id

nonspeech speech model

clustering speaker id

following nonspeech

s slide

set of speaker

rt04s system slide