speaker authentication - interspeech 2011_v3
TRANSCRIPT
H. Aronowitz (IBM Research) Interspeech 2011 1/27
Hagai Aronowitz, Ron Hoory Jason Pelecanos, David Nahamoo IBM Research – Haifa IBM T.J. Watson Research Center
New Developments in Voice Biometrics for User Authentication
H. Aronowitz (IBM Research) Interspeech 2011 2/27
Speaker Verification for Mobile Banking Transactions
Mobile banking services “I want to transfer 10K Dollars from my account to account
#53463985” Current solution is based on RSA SecurID
Proposed solution: multi-factor authentication– Speaker verification– Face recognition– …
H. Aronowitz (IBM Research) Interspeech 2011 3/27
The User Authentication Evaluation
The evaluation focuses on speaker verification
Wells-Fargo (WF) bank collected data from 750 employees
IBM Research participated in the evaluation
Evaluation rules are similar to NIST-SRE rules (however, gender is assumed to be unknown)
H. Aronowitz (IBM Research) Interspeech 2011 4/27
Outline
1. Evaluation description
2. Technology Text-independent Text-dependent
3. Improvements
4. Results
5. Post-evaluation work and conclusions
H. Aronowitz (IBM Research) Interspeech 2011 5/27
Evaluation Description
H. Aronowitz (IBM Research) Interspeech 2011 6/27
Authentication Conditions1. A global digit string such as 0123456789
Attackers may use a recording Easiest to classify Denoted by the global condition
2. A speaker dependent password such as “4131024773” May be eavesdropped / recorded We assume the worst case scenario: impostor knows the password Denoted by the speaker condition
3. A prompted random digit-string Hardest to accurately authenticate Denoted by the prompted condition
4. Free speech More natural especially for call-center scenario Denoted by the TI condition
H. Aronowitz (IBM Research) Interspeech 2011 7/27
WF POT Data 750 speakers (200 for Dev, 550 for Eval) Data recorded over 4 weeks 4 sessions recorded per speaker
2 landline + 2 cellular
Each session consists of all authentication conditions Some digit-strings are repeated 3 times in order to allow
enrollment/verification with more that a single repetition Dev data
Condition Dev data
Global Same digit-strings as evaluated
Speakerdifferent digit-strings than evaluated
prompted
TI different text than evaluated
H. Aronowitz (IBM Research) Interspeech 2011 8/27
Technology
H. Aronowitz (IBM Research) Interspeech 2011 9/27
Speaker Verification SystemsText-independent systems
GMM-based Joint Factor Analysis (JFA)
GMM-based Nuisance Attribute Projection (NAP)
We use both systems for all authentication conditions
Text-dependent system
HMM-based NAP
We use this system for the global condition only
H. Aronowitz (IBM Research) Interspeech 2011 10/27
GMM-Based JFA A standard JFA-system:
Hyperparameters (m, V, D, U) estimated from standard telephony data
- Switchboard-II, NIST 2004 & 2006
- 12,711 sessions in total
Front end: VAD + 12 MFCC+12 Δ+12 ΔΔ + feature warping
Linear scoring
Symmetric scoring: forward + reverse scoring
ZT-score normalization using WF-POT Dev data
- 800 sessions (200 speakers X 4 sessions)
UxDzVymM
H. Aronowitz (IBM Research) Interspeech 2011 11/27
GMM-Based NAP Baseline UBM & NAP are trained from NIST 2004
Supervectors created using normalized GMM-means
Front end: 13 MFCC+13 Δ+ VAD + feature warping
Dot product scoring
ZT-score normalization using same data as the JFA system
H. Aronowitz (IBM Research) Interspeech 2011 12/27
HMM-Based NAP Speaker independent (SI) HMM training
- Using text-matched Dev data (200 speakers X 4 sessions) MAP adaptation estimation of session dependent HMMs
- 3 repetitions used for enrollment- 1 or 2 repetitions used for verification
Supervectors created using normalized GMM-means of the HMMs
Front-end NAP Scoring Score normalization
Same as for the GMM-NAP system
H. Aronowitz (IBM Research) Interspeech 2011 13/27
Linear FusionGlobal condition HMM-NAP score – 50% GMM-JFA score – 25% GMM-NAP score – 25%
Speaker, prompted & TI conditions GMM-JFA score – 50% GMM-NAP score – 50%
H. Aronowitz (IBM Research) Interspeech 2011 14/27
Improvements
H. Aronowitz (IBM Research) Interspeech 2011 15/27
ImprovementsMain method – tuning to the WF POT Dev data JFA – hard to tune because needs large amounts of data HMM-NAP – already tuned to Dev data GMM-NAP – we can tune the UBM and the NAP
projection
Methodology Research focused on the global condition Conclusions have been applied to other conditions
Extended dataset for the global condition 6 different 10 digit-strings + 2 textual passwords
(“At WF my voice is my password”, “There is no place like home”) Channel conditions:
- 75% mismatched trials- 25% matched trials
H. Aronowitz (IBM Research) Interspeech 2011 16/27
GMM-Based NAP ImprovementsImproved NAP
2-wire NAP In [1] we have shown that removal of the speaker-
subspace improves accuracy compared to no subspace removal
In [2] we have shown than removal of dominant components of the speaker-subspace on top of the channel-subspace outperforms standard NAP
Theoretic motivation was given for speaker-ID in 2-wire data [2] but improvements were observes also for 4-wire speaker-ID
On the WF data we observe 6% rel. error reduction[1] H. Aronowitz, “Speaker Recognition using Kernel-PCA and Intersession Variability
Modeling”, in Proc. Interspeech, 2007.[2] Y. A. Solewicz, H. Aronowitz, "Two-Wire Nuisance Attribute Projection", in Proc.
Interspeech 2009.
H. Aronowitz (IBM Research) Interspeech 2011 17/27
NAP training data Baseline: NIST-2004 IBM 2003 digits dataset
21% error reduction The whole WF Dev-set
26% error reduction Text matched utterances from WF Dev-set
29% error reductionSetup UBM is trained from NIST04 data Same trend is observed when UBM is trained from IBM
2003 digits dataset / WF-POT data
GMM-Based NAP ImprovementsNAP training
H. Aronowitz (IBM Research) Interspeech 2011 18/27
GMM-Based NAP ImprovementsUBM training data
UBM training data Baseline: NIST-2004 IBM 2003 digits dataset
4% error reduction Text matched utterances from WF Dev-set
15% error reduction
Setup NAP is trained from text-matched WF-POT data Same trend is observed when NAP is trained from NIST04
or IBM 2003 digits dataset
H. Aronowitz (IBM Research) Interspeech 2011 19/27
GMM-Based NAP ImprovementsSummary
Methods1. 2-wire NAP2. Text matched data for UBM training3. Text matched data for NAP training
Results 40% error reduction compared to using NIST dev data for
UBM and NAP training 25% error reduction compared to using IBM 2003 digits
dataset for UBM and NAP training These techniques have been successfully used for the
speaker, prompted and TI conditions
H. Aronowitz (IBM Research) Interspeech 2011 20/27
Results
H. Aronowitz (IBM Research) Interspeech 2011 21/27
Results on NIST-2008
GMM JFA GMM NAP
1.4 3.6
Condition short2-short3 tel-tel MalesResults are in EER (%)
H. Aronowitz (IBM Research) Interspeech 2011 22/27
Results for Single Verification UtteranceMatched channel
Condition GMMJFA
GMMNAP
HMMNAP
Fused
Global 1.70 1.01 0.90 0.70Speaker 2.21 1.82 - 1.26Prompted 6.49 5.63 - 3.40TI 1.24 1.35 - 0.65
Mismatched channel
Condition GMMJFA
GMMNAP
HMMNAP
Fused
Global 5.07 2.99 2.35 1.95Speaker 5.68 5.05 - 3.64Prompted 12.33 11.85 - 8.33TI 4.24 4.85 - 2.50
H. Aronowitz (IBM Research) Interspeech 2011 23/27
Results for Two Verification UtterancesMatched channel
Mismatched channel
Condition GMMJFA
GMMNAP
HMMNAP
Fused
Global 1.05 0.86 0.66 0.55Speaker 1.50 1.37 - 0.85
Condition GMMJFA
GMMNAP
HMMNAP
Fused
Global 3.34 1.99 1.66 1.41Speaker 4.11 3.97 - 2.74
H. Aronowitz (IBM Research) Interspeech 2011 24/27
TI Accuracy as Function of Session Length Enrollment Two sessions (1 landline + 1 cellular) Enrollment sessions length: ~25 sec each
H. Aronowitz (IBM Research) Interspeech 2011 25/27
Post-Evaluation Work & Conclusions
H. Aronowitz (IBM Research) Interspeech 2011 26/27
Post Evaluation Work Error reduction (~20%) i-vector based system Weighted symmetric scoring* Robust scoring*
Handling estimation uncertainty by weighting the contribution of each Gaussian using a geometric mean of the Gaussian occupancy counts.Motivated by [Campbell, 2010].
Goat Detection Talk given earlier today by Orith Toledo-RonenFast JFA scoring Using efficient approximated factors estimation*
* H. Aronowitz, O Barkan, “New Developments in Joint Factor Analysis for Speaker Verification”, in Proc. Interspeech 2011.Talk will be given today at 4:20 PM
H. Aronowitz (IBM Research) Interspeech 2011 27/27
Conclusions
1. We evaluated JFA, GMM-NAP, HMM-NAP and a fused system on 4 authentication conditions
2. HMM-NAP was the best standalone system for the global condition
3. GMM-NAP outperformed JFA on the TD conditions due to its full usage of the WF POT Dev dataBaseline GMM-NAP was improved by 40% using better Dev data for UBM and NAP-projection estimation and using 2-wire-NAP
4. EERs lower than 1% have been obtained for the matched channel condition
5. EER triples for the mismatched channel condition
6. Multi-condition authentication leads to even smaller EERs