-
JHU-HLTCOE system for VoxSRC 2019
Daniel Garcia-Romero, Alan McCree, David Snyder, and Gregory Sell
Human Language Technology Center Of Excellence
Johns Hopkins University
-
VoxSRC characteristics
• Unlike recent NIST SRE setups:• No explicit domain shift
• Pair-wise comparison of relatively short segments• VoxCeleb1-E validation set has a median segment duration of 7 seconds
• These characteristics:• Minimize the need for an external classifier:
• Not necessary to perform domain adaptation
• No need to aggregate information from multiple long recordings (e.g. 5 minutes)
• Open the door for end-to-end approaches:• DNNs that output direct pair-wise scores (trained with binary cross-entropy)
• However training is cumbersome (e.g., high sensitivity to harvesting samples and unstable convergence)
-
Our strategy
• Train embedding extractors using a classification loss• Categorical/multi-class cross-entropy
• Two modules• Embedding extractor (x-vectors)
• E-TDNN
• F-TDNN
• Classification head• Surrogate classification task to produce discriminative embeddings
• We have explored 4 alternatives• Softmax, Angular-Margin Softmax, PLDA softmax, Meta-learning softmax
Embedding
extractor
Classification head
s1 s2 s3 s4 sN
-
X-vector extractor
Temporal statistics layer
(mean, variance)
Bottleneck
Classification head
s1 s2 s4 sN
1D Conv layers
(TDNN)
s3
Three stages
-
time
Input layer
First layer
Second layer
Third layer
1D-conv stack (TDNN)
-
time
Input layer
First layer
Second layer
Third layer
1D-conv stack (TDNN)
-
time
Input layer
First layer
Second layer
Third layer
1D-conv stack (TDNN)
-
time
Input layer
First layer
Second layer
Third layer
1D-conv stack (TDNN)
-
time
Input layer
First layer
Second layer
Third layer
1D-conv stack (TDNN)
-
time
Input layer
First layer
Second layer
Third layer
1D-conv stack (TDNN)
-
time
Input layer
First layer
Second layer
Third layer
1D-conv stack (TDNN)
-
time
Input layer
First layer
Second layer
Third layer
1D-conv stack (TDNN)
-
time
Input layer
First layer
Second layer
Third layer
. . .
1D-conv stack (TDNN)
-
time
Input layer
First layer
Second layer
Third layer
. . .
Temporal statistics layer(mean, variance)
Affine + LeakyReLU(Bottleneck)
Affine + Softmax
s1 s2 s3 s4 sN
Temporal pooling
-
time
Input layer
First layer
Second layer
Third layer
. . .
Temporal statistics layer(mean, variance)
Affine + LeakyReLU(Bottleneck)
Affine + Softmax
s1 s2 s3 s4 sN
Bottleneck
-
Extended-TDNN (E-TDNN)
-
Factorized-TDNN (F-TDNN)
-
Classification heads
Temporal pooling layer
(mean, variance)
Bottleneck
LeakyReLU
Affine + Softmax
TDNN layers
~3000 dim
256 dim
256 dim
Temporal pooling layer
(mean, variance)
Bottleneck
Length-norm
Ang. Margin Softmax
TDNN layers
~3000 dim
256 dim
256 dim
Traditional Softmax Angular Softmax with margin penalty
• Affine layer (class weights)• Inner products with x-vec• Use external classifier (PLDA)
• Length-normalized linear layer • Cosine score with x-vec as logits• Cosine scoring or PLDA
-
Classification heads
Temporal pooling layer
(mean, variance)
Bottleneck
LeakyReLU
PLDA + Softmax
TDNN layers
~3000 dim
256 dim
256 dim
Temporal pooling layer
(mean, variance)
Bottleneck
Length-norm
Ang. Softmax (subset)
TDNN layers
~3000 dim
256 dim
256 dim
PLDA Softmax Meta-learning Softmax
• Class weights (moving average)• PLDA scoring for logits• Use internal classifier (PLDA)
• Each minibatch contains pairs of a subset of classes• One of them is used as class weight• Cosine scoring
-
Scoring
• External Gaussian PLDA classifier• Trained on x-vectors from concatenated segments of a video
• Length-normalization applied to x-vectors
• Internal Gaussian PLDA classifier• Use the parameters trained jointly with the x-vector extractor
• Length-normalization applied to x-vectors
• Cosine scoring• Inner product of unit-length x-vectors
-
Data preparation
• Augmented VoxCeleb2-dev set for DNN training• 1 million original + 5 million augmented versions
• Randomly pick augmentations from: reverb, noise, music, babble• Typical Kaldi setup
• VoxCeleb1-E (cleaned) as validation set
• 16KHz processing• 80 Mel-filter bank energies (20-7600Hz)
• 25ms window every 10ms
• No speech activity detection (SAD) used• Initially we attempted to use a SAD DNN (trained on open data) to
differentiate between OPEN and FIXED tracks but did not make any difference
-
DNN training in PyTorch• Data parallelism (4 Nvidia 2080 RTX)
• Each GPU 128 samples (512 mini-bacth)
• Batch-norm was not synchronous (slightly faster)
• Batch construction • Augmented data is stored in Kaldi “arks” of 80 Mel-Filter Bank energies
• Uniformly sample (without replacement) subset of classes
• An audio sample from each class is selected• 4 sec chunk extracted using a random offset
• After all classes are sampled we repeat until end of training
• SGD with momentum optimizer• Learning rate of 0.4 fixed for 50K steps and the exponential decay every 10K
• 150K steps to train the nets
-
Results (VoxCeleb1-E): Size vs performance
• E-TDNN with AM-Softmax (m=0.3, s=30)
3.28
2.26
1.77 1.61 1.51
0
0.5
1
1.5
2
2.5
3
3.5
128 / 0.5 256 / 1.5 512 / 5 1024 / 15.5 2048 / 60
EER
(%
)
Layer size / Paramaters in millions
-
Results (VoxCeleb1-E): Cosine vs PLDA classifier
• E-TDNN with AM-Softmax (m=0.3, s=30)
3.28
2.26
1.771.61 1.51
3.08
2.13
1.73 1.61 1.53
0
0.5
1
1.5
2
2.5
3
3.5
128 / 0.5 256 / 1.5 512 / 5 1024 / 15.5 2048 / 60
EER
(%
)
Layer size / Parameters in millions
Cosine G-PLDA
-
Results: System comparison
SystemClassification
head
Layer/ Emb.
dimension
Params
(millions)Scoring
VoxCeleb1-E
EER (%)
VoxSRC eval
EER (%)
E-TDNNAM-Softmax
(m=0.3, s=30)1024/256 15.5 Cosine 1.61 1.91
E-TDNNAM-Softmax
(m=0.1, s=50)1024/512 16 G-PLDA 1.74 2.15
F-TDNN Softmax 1024/512 12.7 G-PLDA 1.93 2.41
E-TDNNPLDA-
Softmax1024/256 15.5 G-PLDA 1.93 2.27
E-TDNNMeta-learnig
Softmax512/256 5 Cosine 2.43 ----
-
Results: Fusion of 4 heterogeneous systems
SystemClassification
head
Layer/ Emb.
dimension
Params
(millions)Scoring
VoxCeleb1-E
EER (%)
VoxSRC eval
EER (%)
E-TDNNAM-Softmax
(m=0.3, s=30)1024/256 15.5 Cosine 1.61 1.91
E-TDNNAM-Softmax
(m=0.1, s=50)1024/512 16 G-PLDA 1.74 2.15
F-TDNN Softmax 1024/512 12.7 G-PLDA 1.93 2.41
E-TDNNPLDA-
Softmax1024/256 15.5 G-PLDA 1.93 2.27
Sum fusion ---- ---- 57 ---- 1.22 1.54
-
Results: Fusion vs single large DNN
SystemClassification
head
Layer/ Emb.
dimension
Params
(millions)Scoring
VoxCeleb1-E
EER (%)
VoxSRC eval
EER (%)
E-TDNNAM-Softmax
(m=0.3, s=30)2048/256 60 Cosine 1.51 1.72
Sum fusion ---- ---- 57 ---- 1.22 1.54
• Fusing smaller heterogeneous systems outperforms a large DNN with similar number of parameters
-
Conclusions
• VoxSRC provides alternative challenges from those of recent NIST SREs • No explicit domain shift
• Pair-wise comparison of relatively short segments
• We explored 2 x-vector topologies and 4 classification heads• Angular Softmax with margin penalty produces strong systems that can use
cosine distance for pair-wise comparisons
• PLDA-Softmax was able to control its capacity and generalized well at test time
• Our attempt using meta-learning was not very encouraging
• Fusion of 4 systems outperforms a single large network with similar number of parameters