scott settembre [email protected] cse 734 : cyber physical spaces

SPEAKER RECOGNITIONScott Settembre

[email protected] 734 : Cyber Physical Spaces

mailto:[email protected]

Scott Settembre [[email protected]] 2

Overview

• Speaker Identification• Speaker Validation• Two types of Recognition methods– Text dependent vs. Text independent

• Speaker Recognition steps• Conclusion / References

March 16, 2009


Speaker Identification

• Determines the speaker from a set of registered speakers– This is called a “closed” set identification– Result is the best speaker matched

• What if the speaker is not in the database?– This is called an “open” set identification– Result can be a speaker or a no-match result

March 16, 2009


Speaker Identification Diagram

March 16, 2009

Actual Speaker

Input

Normalization Feature Extraction

Speaker Database

Calculate similarity to each speaker template or

model

Select best match

Identification of Speaker

Enro

llmen

t


Overview



March 16, 2009


Speaker Validation

• Also called “Verification” or “Authentication”• Determines if the voice matches a particular

registered speaker– Result is the probability of a match or a similarity

measure• Similarity must exceed a particular threshold– Higher threshold produces more false negatives– Lower threshold produces more false positives– Voice variability and security issues make this a difficult

threshold value to determine (more later)March 16, 2009


Speaker Validation Diagram

March 16, 2009

Actual Speaker

Input

Normalization Feature Extraction

Speaker Database

Calculate similarity to

given template or model

Does similarity exceed threshold?

Verification (Accept/Reject)

Speaker template or model

Speaker ID

Enro

llmen

t


Overview



March 16, 2009


Recognition Methods

• Text Dependent– Requires user to speak text spoken at enrollment– Usually a name, password, or phrase– Text Prompting is used to combat deception• The system requires the user to repeat back a random

phrase or list of numbers

• Video example from “CSAIL” - Spoken Language Systems group at MIT.

March 16, 2009

Scott Settembre [[email protected]] 10March 16, 2009


Recognition Methods, cont.

• Text Independent– Non-invasive, does not require user to actively

answer prompts– Longer enrollment phase required, more training

data needed– Focuses on a subset of audio/phonetic features

• Video example from Nathan Harrington at IBM developerWorks.

March 16, 2009

Scott Settembre [[email protected]] 12March 16, 2009


Overview



March 16, 2009


Speaker Recognition Steps

1. Input Speech2. Normalize captured speech3. Feature extraction4. Similarity matching5. Decision/Threshold

March 16, 2009


Step 1. Input Speech

• Various fidelity from inputs– Telephone, computer microphone, noise

cancelling headset, dedicated capture microphone, room microphones

• Noise– Background noise, room echoes

• Variability in voice– Speaking manner (rate and volume), sickness,

aging, emotions, morning vs. evening voice

March 16, 2009


Step 2. Normalize Captured Speech

• Intersession variability and variability over time cause speech features to fluctuate

• Use of “filter bank” is common• Normalization helps remove these variations,

but at a price– Parameter-Domain normalization– Distance/Similarity-Domain normalization

March 16, 2009


Step 2.a. Normalization Techniques

• Parameter-Domain normalization– Spectral equalization (i.e. signal processing)• Dampens large variations in features by averaging over

time, useful for long utterances• Removes some speaker specific features

• Distance/Similarity-Domain normalization– Various techniques that use probabilities of known

speakers that have already been enrolled• Useful if you are doing validation

March 16, 2009


Step 3. Feature Extraction

• The input utterance is converted to a set of feature vectors

• Time alignment may need to be done

• Calculate similarity between each captured vector with the registered speaker template or model

March 16, 2009

Hello h he e el l lo o

h he e el l lo o

h he e el l lo o

h h .90 similarity he he .60 similarity, .75 overall


Side note : Analyzing speech “ah”

March 16, 2009

Waveform(Raw acoustic data)

Spectrograph(Frequency vs.Amplitude)

Formant(Continuous peakthat crossesfrequencies)

Image attributed to Dr. Douglas Roland from lecture notes describing speech recognition.


Step 4. Similarity Matching

• Other pattern classification techniques can be used on the normalized input

• Each speaker gets his/her own HMM, neural network, VQ codebook, etc.

• Another approach is to target specific phonemes or features– Example showing the targeting of vowel sounds, in

particular the syllable “ah”

March 16, 2009


Example of Vowel Comparisons

March 16, 2009

Charts attributed to Pasich, C. Speaker Identification MATLAB files, Connexions Web site. http://cnx.org/content/m14201/1.3/, Feb 16, 2007.


Step 5. Decision/Threshold

• For speaker identification, simply take the registered speaker template with the highest similarity score

• For speaker verification, there needs to be a minimum acceptable similarity score

March 16, 2009


Overview



March 16, 2009


Conclusion : Why care?

• Speaker recognition will become ubiquitous– Cell phone applications – banking, security, logins– Forensic analysis (voiceprints)– Home automation (know thy user)– Google “speaker” search? (You know it’s going to

happen! )

March 16, 2009


References• Video links

– MIT, CSAIL. http://www.youtube.com/watch?v=0ec1Gtnlq1k– IBM, developerWorks. http://www.youtube.com/watch?v=JJ_YzBaqzAo

• Cole, Ronald A., Editor (1996) Survey of the State of the Art in Human Language Technology. http://cslu.cse.ogi.edu/HLTsurvey/HLTsurvey.html

• Iyer, Manjunath Ramachandra (2007). “Differentially Fed Artificial Neural Networks for Speech Signal Prediction.” In Hector Perez-Meana, Editor. Advances in audio and speech signal processing : technologies and applications (pp. 309-323 ) Hershey, PA : Idea Group Pub., c2007.

• Lung, Shung-Yung (2007). “Speaker Recognition.” In Hector Perez-Meana, Editor. Advances in audio and speech signal processing : technologies and applications (pp. 371-407) Hershey, PA : Idea Group Pub., c2007.

March 16, 2009

http://www.youtube.com/watch?v=0ec1Gtnlq1k

http://www.youtube.com/watch?v=JJ_YzBaqzAo

http://cslu.cse.ogi.edu/HLTsurvey/HLTsurvey.html

scott settembre [email protected] cse 734 : cyber physical spaces

Documents

scott settembre

best speaker

speaker identification

speaker validation diagram

registered speaker template

overall slide

similarity measure similarity

text independent noninvasive