speaker authentication - interspeech 2011_v3

H. Aronowitz (IBM Research) Interspeech 2011 1/27

Hagai Aronowitz, Ron Hoory Jason Pelecanos, David Nahamoo IBM Research – Haifa IBM T.J. Watson Research Center

New Developments in Voice Biometrics for User Authentication


Speaker Verification for Mobile Banking Transactions

Mobile banking services “I want to transfer 10K Dollars from my account to account

#53463985” Current solution is based on RSA SecurID

Proposed solution: multi-factor authentication– Speaker verification– Face recognition– …

http://en.wikipedia.org/wiki/File:RSA_SecurID_SID800.jpg


The User Authentication Evaluation

The evaluation focuses on speaker verification

Wells-Fargo (WF) bank collected data from 750 employees

IBM Research participated in the evaluation

Evaluation rules are similar to NIST-SRE rules (however, gender is assumed to be unknown)


Outline

1. Evaluation description

2. Technology Text-independent Text-dependent

3. Improvements

4. Results

5. Post-evaluation work and conclusions


Evaluation Description


Authentication Conditions1. A global digit string such as 0123456789

Attackers may use a recording Easiest to classify Denoted by the global condition

2. A speaker dependent password such as “4131024773” May be eavesdropped / recorded We assume the worst case scenario: impostor knows the password Denoted by the speaker condition

3. A prompted random digit-string Hardest to accurately authenticate Denoted by the prompted condition

4. Free speech More natural especially for call-center scenario Denoted by the TI condition


WF POT Data 750 speakers (200 for Dev, 550 for Eval) Data recorded over 4 weeks 4 sessions recorded per speaker

2 landline + 2 cellular

Each session consists of all authentication conditions Some digit-strings are repeated 3 times in order to allow

enrollment/verification with more that a single repetition Dev data

Condition Dev data

Global Same digit-strings as evaluated

Speakerdifferent digit-strings than evaluated

prompted

TI different text than evaluated


Technology


Speaker Verification SystemsText-independent systems

GMM-based Joint Factor Analysis (JFA)

GMM-based Nuisance Attribute Projection (NAP)

We use both systems for all authentication conditions

Text-dependent system

HMM-based NAP

We use this system for the global condition only


GMM-Based JFA A standard JFA-system:

Hyperparameters (m, V, D, U) estimated from standard telephony data

- Switchboard-II, NIST 2004 & 2006

- 12,711 sessions in total

Front end: VAD + 12 MFCC+12 Δ+12 ΔΔ + feature warping

Linear scoring

Symmetric scoring: forward + reverse scoring

ZT-score normalization using WF-POT Dev data

- 800 sessions (200 speakers X 4 sessions)

UxDzVymM


GMM-Based NAP Baseline UBM & NAP are trained from NIST 2004

Supervectors created using normalized GMM-means

Front end: 13 MFCC+13 Δ+ VAD + feature warping

Dot product scoring

ZT-score normalization using same data as the JFA system


HMM-Based NAP Speaker independent (SI) HMM training

- Using text-matched Dev data (200 speakers X 4 sessions) MAP adaptation estimation of session dependent HMMs

- 3 repetitions used for enrollment- 1 or 2 repetitions used for verification

Supervectors created using normalized GMM-means of the HMMs

Front-end NAP Scoring Score normalization

Same as for the GMM-NAP system


Linear FusionGlobal condition HMM-NAP score – 50% GMM-JFA score – 25% GMM-NAP score – 25%

Speaker, prompted & TI conditions GMM-JFA score – 50% GMM-NAP score – 50%


Improvements


ImprovementsMain method – tuning to the WF POT Dev data JFA – hard to tune because needs large amounts of data HMM-NAP – already tuned to Dev data GMM-NAP – we can tune the UBM and the NAP

projection

Methodology Research focused on the global condition Conclusions have been applied to other conditions

Extended dataset for the global condition 6 different 10 digit-strings + 2 textual passwords

(“At WF my voice is my password”, “There is no place like home”) Channel conditions:

- 75% mismatched trials- 25% matched trials


GMM-Based NAP ImprovementsImproved NAP

2-wire NAP In [1] we have shown that removal of the speaker-

subspace improves accuracy compared to no subspace removal

In [2] we have shown than removal of dominant components of the speaker-subspace on top of the channel-subspace outperforms standard NAP

Theoretic motivation was given for speaker-ID in 2-wire data [2] but improvements were observes also for 4-wire speaker-ID

On the WF data we observe 6% rel. error reduction[1] H. Aronowitz, “Speaker Recognition using Kernel-PCA and Intersession Variability

Modeling”, in Proc. Interspeech, 2007.[2] Y. A. Solewicz, H. Aronowitz, "Two-Wire Nuisance Attribute Projection", in Proc.

Interspeech 2009.


NAP training data Baseline: NIST-2004 IBM 2003 digits dataset

21% error reduction The whole WF Dev-set

26% error reduction Text matched utterances from WF Dev-set

29% error reductionSetup UBM is trained from NIST04 data Same trend is observed when UBM is trained from IBM

2003 digits dataset / WF-POT data

GMM-Based NAP ImprovementsNAP training


GMM-Based NAP ImprovementsUBM training data

UBM training data Baseline: NIST-2004 IBM 2003 digits dataset

4% error reduction Text matched utterances from WF Dev-set

15% error reduction

Setup NAP is trained from text-matched WF-POT data Same trend is observed when NAP is trained from NIST04

or IBM 2003 digits dataset


GMM-Based NAP ImprovementsSummary

Methods1. 2-wire NAP2. Text matched data for UBM training3. Text matched data for NAP training

Results 40% error reduction compared to using NIST dev data for

UBM and NAP training 25% error reduction compared to using IBM 2003 digits

dataset for UBM and NAP training These techniques have been successfully used for the

speaker, prompted and TI conditions


Results


Results on NIST-2008

GMM JFA GMM NAP

1.4 3.6

Condition short2-short3 tel-tel MalesResults are in EER (%)


Results for Single Verification UtteranceMatched channel

Condition GMMJFA

GMMNAP

HMMNAP

Fused

Global 1.70 1.01 0.90 0.70Speaker 2.21 1.82 - 1.26Prompted 6.49 5.63 - 3.40TI 1.24 1.35 - 0.65

Mismatched channel

Condition GMMJFA

GMMNAP

HMMNAP

Fused

Global 5.07 2.99 2.35 1.95Speaker 5.68 5.05 - 3.64Prompted 12.33 11.85 - 8.33TI 4.24 4.85 - 2.50


Results for Two Verification UtterancesMatched channel

Mismatched channel

Condition GMMJFA

GMMNAP

HMMNAP

Fused

Global 1.05 0.86 0.66 0.55Speaker 1.50 1.37 - 0.85

Condition GMMJFA

GMMNAP

HMMNAP

Fused

Global 3.34 1.99 1.66 1.41Speaker 4.11 3.97 - 2.74


TI Accuracy as Function of Session Length Enrollment Two sessions (1 landline + 1 cellular) Enrollment sessions length: ~25 sec each


Post-Evaluation Work & Conclusions


Post Evaluation Work Error reduction (~20%) i-vector based system Weighted symmetric scoring* Robust scoring*

Handling estimation uncertainty by weighting the contribution of each Gaussian using a geometric mean of the Gaussian occupancy counts.Motivated by [Campbell, 2010].

Goat Detection Talk given earlier today by Orith Toledo-RonenFast JFA scoring Using efficient approximated factors estimation*

* H. Aronowitz, O Barkan, “New Developments in Joint Factor Analysis for Speaker Verification”, in Proc. Interspeech 2011.Talk will be given today at 4:20 PM


Conclusions

1. We evaluated JFA, GMM-NAP, HMM-NAP and a fused system on 4 authentication conditions

2. HMM-NAP was the best standalone system for the global condition

3. GMM-NAP outperformed JFA on the TD conditions due to its full usage of the WF POT Dev dataBaseline GMM-NAP was improved by 40% using better Dev data for UBM and NAP-projection estimation and using 2-wire-NAP

4. EERs lower than 1% have been obtained for the matched channel condition

5. EER triples for the mismatched channel condition

6. Multi-condition authentication leads to even smaller EERs

speaker authentication - interspeech 2011_v3

Documents