Download - Speaker Authentication - Interspeech 2011_v3
![Page 1: Speaker Authentication - Interspeech 2011_v3](https://reader036.vdocuments.us/reader036/viewer/2022081513/5885f15f1a28ab864f8b5e17/html5/thumbnails/1.jpg)
H. Aronowitz (IBM Research) Interspeech 2011 1/27
Hagai Aronowitz, Ron Hoory Jason Pelecanos, David Nahamoo IBM Research – Haifa IBM T.J. Watson Research Center
New Developments in Voice Biometrics for User Authentication
![Page 2: Speaker Authentication - Interspeech 2011_v3](https://reader036.vdocuments.us/reader036/viewer/2022081513/5885f15f1a28ab864f8b5e17/html5/thumbnails/2.jpg)
H. Aronowitz (IBM Research) Interspeech 2011 2/27
Speaker Verification for Mobile Banking Transactions
Mobile banking services “I want to transfer 10K Dollars from my account to account
#53463985” Current solution is based on RSA SecurID
Proposed solution: multi-factor authentication– Speaker verification– Face recognition– …
![Page 3: Speaker Authentication - Interspeech 2011_v3](https://reader036.vdocuments.us/reader036/viewer/2022081513/5885f15f1a28ab864f8b5e17/html5/thumbnails/3.jpg)
H. Aronowitz (IBM Research) Interspeech 2011 3/27
The User Authentication Evaluation
The evaluation focuses on speaker verification
Wells-Fargo (WF) bank collected data from 750 employees
IBM Research participated in the evaluation
Evaluation rules are similar to NIST-SRE rules (however, gender is assumed to be unknown)
![Page 4: Speaker Authentication - Interspeech 2011_v3](https://reader036.vdocuments.us/reader036/viewer/2022081513/5885f15f1a28ab864f8b5e17/html5/thumbnails/4.jpg)
H. Aronowitz (IBM Research) Interspeech 2011 4/27
Outline
1. Evaluation description
2. Technology Text-independent Text-dependent
3. Improvements
4. Results
5. Post-evaluation work and conclusions
![Page 5: Speaker Authentication - Interspeech 2011_v3](https://reader036.vdocuments.us/reader036/viewer/2022081513/5885f15f1a28ab864f8b5e17/html5/thumbnails/5.jpg)
H. Aronowitz (IBM Research) Interspeech 2011 5/27
Evaluation Description
![Page 6: Speaker Authentication - Interspeech 2011_v3](https://reader036.vdocuments.us/reader036/viewer/2022081513/5885f15f1a28ab864f8b5e17/html5/thumbnails/6.jpg)
H. Aronowitz (IBM Research) Interspeech 2011 6/27
Authentication Conditions1. A global digit string such as 0123456789
Attackers may use a recording Easiest to classify Denoted by the global condition
2. A speaker dependent password such as “4131024773” May be eavesdropped / recorded We assume the worst case scenario: impostor knows the password Denoted by the speaker condition
3. A prompted random digit-string Hardest to accurately authenticate Denoted by the prompted condition
4. Free speech More natural especially for call-center scenario Denoted by the TI condition
![Page 7: Speaker Authentication - Interspeech 2011_v3](https://reader036.vdocuments.us/reader036/viewer/2022081513/5885f15f1a28ab864f8b5e17/html5/thumbnails/7.jpg)
H. Aronowitz (IBM Research) Interspeech 2011 7/27
WF POT Data 750 speakers (200 for Dev, 550 for Eval) Data recorded over 4 weeks 4 sessions recorded per speaker
2 landline + 2 cellular
Each session consists of all authentication conditions Some digit-strings are repeated 3 times in order to allow
enrollment/verification with more that a single repetition Dev data
Condition Dev data
Global Same digit-strings as evaluated
Speakerdifferent digit-strings than evaluated
prompted
TI different text than evaluated
![Page 8: Speaker Authentication - Interspeech 2011_v3](https://reader036.vdocuments.us/reader036/viewer/2022081513/5885f15f1a28ab864f8b5e17/html5/thumbnails/8.jpg)
H. Aronowitz (IBM Research) Interspeech 2011 8/27
Technology
![Page 9: Speaker Authentication - Interspeech 2011_v3](https://reader036.vdocuments.us/reader036/viewer/2022081513/5885f15f1a28ab864f8b5e17/html5/thumbnails/9.jpg)
H. Aronowitz (IBM Research) Interspeech 2011 9/27
Speaker Verification SystemsText-independent systems
GMM-based Joint Factor Analysis (JFA)
GMM-based Nuisance Attribute Projection (NAP)
We use both systems for all authentication conditions
Text-dependent system
HMM-based NAP
We use this system for the global condition only
![Page 10: Speaker Authentication - Interspeech 2011_v3](https://reader036.vdocuments.us/reader036/viewer/2022081513/5885f15f1a28ab864f8b5e17/html5/thumbnails/10.jpg)
H. Aronowitz (IBM Research) Interspeech 2011 10/27
GMM-Based JFA A standard JFA-system:
Hyperparameters (m, V, D, U) estimated from standard telephony data
- Switchboard-II, NIST 2004 & 2006
- 12,711 sessions in total
Front end: VAD + 12 MFCC+12 Δ+12 ΔΔ + feature warping
Linear scoring
Symmetric scoring: forward + reverse scoring
ZT-score normalization using WF-POT Dev data
- 800 sessions (200 speakers X 4 sessions)
UxDzVymM
![Page 11: Speaker Authentication - Interspeech 2011_v3](https://reader036.vdocuments.us/reader036/viewer/2022081513/5885f15f1a28ab864f8b5e17/html5/thumbnails/11.jpg)
H. Aronowitz (IBM Research) Interspeech 2011 11/27
GMM-Based NAP Baseline UBM & NAP are trained from NIST 2004
Supervectors created using normalized GMM-means
Front end: 13 MFCC+13 Δ+ VAD + feature warping
Dot product scoring
ZT-score normalization using same data as the JFA system
![Page 12: Speaker Authentication - Interspeech 2011_v3](https://reader036.vdocuments.us/reader036/viewer/2022081513/5885f15f1a28ab864f8b5e17/html5/thumbnails/12.jpg)
H. Aronowitz (IBM Research) Interspeech 2011 12/27
HMM-Based NAP Speaker independent (SI) HMM training
- Using text-matched Dev data (200 speakers X 4 sessions) MAP adaptation estimation of session dependent HMMs
- 3 repetitions used for enrollment- 1 or 2 repetitions used for verification
Supervectors created using normalized GMM-means of the HMMs
Front-end NAP Scoring Score normalization
Same as for the GMM-NAP system
![Page 13: Speaker Authentication - Interspeech 2011_v3](https://reader036.vdocuments.us/reader036/viewer/2022081513/5885f15f1a28ab864f8b5e17/html5/thumbnails/13.jpg)
H. Aronowitz (IBM Research) Interspeech 2011 13/27
Linear FusionGlobal condition HMM-NAP score – 50% GMM-JFA score – 25% GMM-NAP score – 25%
Speaker, prompted & TI conditions GMM-JFA score – 50% GMM-NAP score – 50%
![Page 14: Speaker Authentication - Interspeech 2011_v3](https://reader036.vdocuments.us/reader036/viewer/2022081513/5885f15f1a28ab864f8b5e17/html5/thumbnails/14.jpg)
H. Aronowitz (IBM Research) Interspeech 2011 14/27
Improvements
![Page 15: Speaker Authentication - Interspeech 2011_v3](https://reader036.vdocuments.us/reader036/viewer/2022081513/5885f15f1a28ab864f8b5e17/html5/thumbnails/15.jpg)
H. Aronowitz (IBM Research) Interspeech 2011 15/27
ImprovementsMain method – tuning to the WF POT Dev data JFA – hard to tune because needs large amounts of data HMM-NAP – already tuned to Dev data GMM-NAP – we can tune the UBM and the NAP
projection
Methodology Research focused on the global condition Conclusions have been applied to other conditions
Extended dataset for the global condition 6 different 10 digit-strings + 2 textual passwords
(“At WF my voice is my password”, “There is no place like home”) Channel conditions:
- 75% mismatched trials- 25% matched trials
![Page 16: Speaker Authentication - Interspeech 2011_v3](https://reader036.vdocuments.us/reader036/viewer/2022081513/5885f15f1a28ab864f8b5e17/html5/thumbnails/16.jpg)
H. Aronowitz (IBM Research) Interspeech 2011 16/27
GMM-Based NAP ImprovementsImproved NAP
2-wire NAP In [1] we have shown that removal of the speaker-
subspace improves accuracy compared to no subspace removal
In [2] we have shown than removal of dominant components of the speaker-subspace on top of the channel-subspace outperforms standard NAP
Theoretic motivation was given for speaker-ID in 2-wire data [2] but improvements were observes also for 4-wire speaker-ID
On the WF data we observe 6% rel. error reduction[1] H. Aronowitz, “Speaker Recognition using Kernel-PCA and Intersession Variability
Modeling”, in Proc. Interspeech, 2007.[2] Y. A. Solewicz, H. Aronowitz, "Two-Wire Nuisance Attribute Projection", in Proc.
Interspeech 2009.
![Page 17: Speaker Authentication - Interspeech 2011_v3](https://reader036.vdocuments.us/reader036/viewer/2022081513/5885f15f1a28ab864f8b5e17/html5/thumbnails/17.jpg)
H. Aronowitz (IBM Research) Interspeech 2011 17/27
NAP training data Baseline: NIST-2004 IBM 2003 digits dataset
21% error reduction The whole WF Dev-set
26% error reduction Text matched utterances from WF Dev-set
29% error reductionSetup UBM is trained from NIST04 data Same trend is observed when UBM is trained from IBM
2003 digits dataset / WF-POT data
GMM-Based NAP ImprovementsNAP training
![Page 18: Speaker Authentication - Interspeech 2011_v3](https://reader036.vdocuments.us/reader036/viewer/2022081513/5885f15f1a28ab864f8b5e17/html5/thumbnails/18.jpg)
H. Aronowitz (IBM Research) Interspeech 2011 18/27
GMM-Based NAP ImprovementsUBM training data
UBM training data Baseline: NIST-2004 IBM 2003 digits dataset
4% error reduction Text matched utterances from WF Dev-set
15% error reduction
Setup NAP is trained from text-matched WF-POT data Same trend is observed when NAP is trained from NIST04
or IBM 2003 digits dataset
![Page 19: Speaker Authentication - Interspeech 2011_v3](https://reader036.vdocuments.us/reader036/viewer/2022081513/5885f15f1a28ab864f8b5e17/html5/thumbnails/19.jpg)
H. Aronowitz (IBM Research) Interspeech 2011 19/27
GMM-Based NAP ImprovementsSummary
Methods1. 2-wire NAP2. Text matched data for UBM training3. Text matched data for NAP training
Results 40% error reduction compared to using NIST dev data for
UBM and NAP training 25% error reduction compared to using IBM 2003 digits
dataset for UBM and NAP training These techniques have been successfully used for the
speaker, prompted and TI conditions
![Page 20: Speaker Authentication - Interspeech 2011_v3](https://reader036.vdocuments.us/reader036/viewer/2022081513/5885f15f1a28ab864f8b5e17/html5/thumbnails/20.jpg)
H. Aronowitz (IBM Research) Interspeech 2011 20/27
Results
![Page 21: Speaker Authentication - Interspeech 2011_v3](https://reader036.vdocuments.us/reader036/viewer/2022081513/5885f15f1a28ab864f8b5e17/html5/thumbnails/21.jpg)
H. Aronowitz (IBM Research) Interspeech 2011 21/27
Results on NIST-2008
GMM JFA GMM NAP
1.4 3.6
Condition short2-short3 tel-tel MalesResults are in EER (%)
![Page 22: Speaker Authentication - Interspeech 2011_v3](https://reader036.vdocuments.us/reader036/viewer/2022081513/5885f15f1a28ab864f8b5e17/html5/thumbnails/22.jpg)
H. Aronowitz (IBM Research) Interspeech 2011 22/27
Results for Single Verification UtteranceMatched channel
Condition GMMJFA
GMMNAP
HMMNAP
Fused
Global 1.70 1.01 0.90 0.70Speaker 2.21 1.82 - 1.26Prompted 6.49 5.63 - 3.40TI 1.24 1.35 - 0.65
Mismatched channel
Condition GMMJFA
GMMNAP
HMMNAP
Fused
Global 5.07 2.99 2.35 1.95Speaker 5.68 5.05 - 3.64Prompted 12.33 11.85 - 8.33TI 4.24 4.85 - 2.50
![Page 23: Speaker Authentication - Interspeech 2011_v3](https://reader036.vdocuments.us/reader036/viewer/2022081513/5885f15f1a28ab864f8b5e17/html5/thumbnails/23.jpg)
H. Aronowitz (IBM Research) Interspeech 2011 23/27
Results for Two Verification UtterancesMatched channel
Mismatched channel
Condition GMMJFA
GMMNAP
HMMNAP
Fused
Global 1.05 0.86 0.66 0.55Speaker 1.50 1.37 - 0.85
Condition GMMJFA
GMMNAP
HMMNAP
Fused
Global 3.34 1.99 1.66 1.41Speaker 4.11 3.97 - 2.74
![Page 24: Speaker Authentication - Interspeech 2011_v3](https://reader036.vdocuments.us/reader036/viewer/2022081513/5885f15f1a28ab864f8b5e17/html5/thumbnails/24.jpg)
H. Aronowitz (IBM Research) Interspeech 2011 24/27
TI Accuracy as Function of Session Length Enrollment Two sessions (1 landline + 1 cellular) Enrollment sessions length: ~25 sec each
![Page 25: Speaker Authentication - Interspeech 2011_v3](https://reader036.vdocuments.us/reader036/viewer/2022081513/5885f15f1a28ab864f8b5e17/html5/thumbnails/25.jpg)
H. Aronowitz (IBM Research) Interspeech 2011 25/27
Post-Evaluation Work & Conclusions
![Page 26: Speaker Authentication - Interspeech 2011_v3](https://reader036.vdocuments.us/reader036/viewer/2022081513/5885f15f1a28ab864f8b5e17/html5/thumbnails/26.jpg)
H. Aronowitz (IBM Research) Interspeech 2011 26/27
Post Evaluation Work Error reduction (~20%) i-vector based system Weighted symmetric scoring* Robust scoring*
Handling estimation uncertainty by weighting the contribution of each Gaussian using a geometric mean of the Gaussian occupancy counts.Motivated by [Campbell, 2010].
Goat Detection Talk given earlier today by Orith Toledo-RonenFast JFA scoring Using efficient approximated factors estimation*
* H. Aronowitz, O Barkan, “New Developments in Joint Factor Analysis for Speaker Verification”, in Proc. Interspeech 2011.Talk will be given today at 4:20 PM
![Page 27: Speaker Authentication - Interspeech 2011_v3](https://reader036.vdocuments.us/reader036/viewer/2022081513/5885f15f1a28ab864f8b5e17/html5/thumbnails/27.jpg)
H. Aronowitz (IBM Research) Interspeech 2011 27/27
Conclusions
1. We evaluated JFA, GMM-NAP, HMM-NAP and a fused system on 4 authentication conditions
2. HMM-NAP was the best standalone system for the global condition
3. GMM-NAP outperformed JFA on the TD conditions due to its full usage of the WF POT Dev dataBaseline GMM-NAP was improved by 40% using better Dev data for UBM and NAP-projection estimation and using 2-wire-NAP
4. EERs lower than 1% have been obtained for the matched channel condition
5. EER triples for the mismatched channel condition
6. Multi-condition authentication leads to even smaller EERs