scott settembre [email protected] cse 734 : cyber physical spaces

25
SPEAKER RECOGNITION Scott Settembre [email protected] CSE 734 : Cyber Physical Spaces

Upload: adonis-wiglesworth

Post on 29-Mar-2015

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Scott Settembre ss424@cse.buffalo.edu CSE 734 : Cyber Physical Spaces

SPEAKER RECOGNITIONScott Settembre

[email protected] 734 : Cyber Physical Spaces

Page 2: Scott Settembre ss424@cse.buffalo.edu CSE 734 : Cyber Physical Spaces

Scott Settembre [[email protected]] 2

Overview

• Speaker Identification• Speaker Validation• Two types of Recognition methods– Text dependent vs. Text independent

• Speaker Recognition steps• Conclusion / References

March 16, 2009

Page 3: Scott Settembre ss424@cse.buffalo.edu CSE 734 : Cyber Physical Spaces

Scott Settembre [[email protected]] 3

Speaker Identification

• Determines the speaker from a set of registered speakers– This is called a “closed” set identification– Result is the best speaker matched

• What if the speaker is not in the database?– This is called an “open” set identification– Result can be a speaker or a no-match result

March 16, 2009

Page 4: Scott Settembre ss424@cse.buffalo.edu CSE 734 : Cyber Physical Spaces

Scott Settembre [[email protected]] 4

Speaker Identification Diagram

March 16, 2009

Actual Speaker

Input

Normalization Feature Extraction

Speaker Database

Calculate similarity to each speaker template or

model

Select best match

Identification of Speaker

Enro

llmen

t

Page 5: Scott Settembre ss424@cse.buffalo.edu CSE 734 : Cyber Physical Spaces

Scott Settembre [[email protected]] 5

Overview

• Speaker Identification• Speaker Validation• Two types of Recognition methods– Text dependent vs. Text independent

• Speaker Recognition steps• Conclusion / References

March 16, 2009

Page 6: Scott Settembre ss424@cse.buffalo.edu CSE 734 : Cyber Physical Spaces

Scott Settembre [[email protected]] 6

Speaker Validation

• Also called “Verification” or “Authentication”• Determines if the voice matches a particular

registered speaker– Result is the probability of a match or a similarity

measure• Similarity must exceed a particular threshold– Higher threshold produces more false negatives– Lower threshold produces more false positives– Voice variability and security issues make this a difficult

threshold value to determine (more later)March 16, 2009

Page 7: Scott Settembre ss424@cse.buffalo.edu CSE 734 : Cyber Physical Spaces

Scott Settembre [[email protected]] 7

Speaker Validation Diagram

March 16, 2009

Actual Speaker

Input

Normalization Feature Extraction

Speaker Database

Calculate similarity to

given template or model

Does similarity exceed threshold?

Verification (Accept/Reject)

Speaker template or model

Speaker ID

Enro

llmen

t

Page 8: Scott Settembre ss424@cse.buffalo.edu CSE 734 : Cyber Physical Spaces

Scott Settembre [[email protected]] 8

Overview

• Speaker Identification• Speaker Validation• Two types of Recognition methods– Text dependent vs. Text independent

• Speaker Recognition steps• Conclusion / References

March 16, 2009

Page 9: Scott Settembre ss424@cse.buffalo.edu CSE 734 : Cyber Physical Spaces

Scott Settembre [[email protected]] 9

Recognition Methods

• Text Dependent– Requires user to speak text spoken at enrollment– Usually a name, password, or phrase– Text Prompting is used to combat deception• The system requires the user to repeat back a random

phrase or list of numbers

• Video example from “CSAIL” - Spoken Language Systems group at MIT.

March 16, 2009

Page 10: Scott Settembre ss424@cse.buffalo.edu CSE 734 : Cyber Physical Spaces

Scott Settembre [[email protected]] 10March 16, 2009

Page 11: Scott Settembre ss424@cse.buffalo.edu CSE 734 : Cyber Physical Spaces

Scott Settembre [[email protected]] 11

Recognition Methods, cont.

• Text Independent– Non-invasive, does not require user to actively

answer prompts– Longer enrollment phase required, more training

data needed– Focuses on a subset of audio/phonetic features

• Video example from Nathan Harrington at IBM developerWorks.

March 16, 2009

Page 12: Scott Settembre ss424@cse.buffalo.edu CSE 734 : Cyber Physical Spaces

Scott Settembre [[email protected]] 12March 16, 2009

Page 13: Scott Settembre ss424@cse.buffalo.edu CSE 734 : Cyber Physical Spaces

Scott Settembre [[email protected]] 13

Overview

• Speaker Identification• Speaker Validation• Two types of Recognition methods– Text dependent vs. Text independent

• Speaker Recognition steps• Conclusion / References

March 16, 2009

Page 14: Scott Settembre ss424@cse.buffalo.edu CSE 734 : Cyber Physical Spaces

Scott Settembre [[email protected]] 14

Speaker Recognition Steps

1. Input Speech2. Normalize captured speech3. Feature extraction4. Similarity matching5. Decision/Threshold

March 16, 2009

Page 15: Scott Settembre ss424@cse.buffalo.edu CSE 734 : Cyber Physical Spaces

Scott Settembre [[email protected]] 15

Step 1. Input Speech

• Various fidelity from inputs– Telephone, computer microphone, noise

cancelling headset, dedicated capture microphone, room microphones

• Noise– Background noise, room echoes

• Variability in voice– Speaking manner (rate and volume), sickness,

aging, emotions, morning vs. evening voice

March 16, 2009

Page 16: Scott Settembre ss424@cse.buffalo.edu CSE 734 : Cyber Physical Spaces

Scott Settembre [[email protected]] 16

Step 2. Normalize Captured Speech

• Intersession variability and variability over time cause speech features to fluctuate

• Use of “filter bank” is common• Normalization helps remove these variations,

but at a price– Parameter-Domain normalization– Distance/Similarity-Domain normalization

March 16, 2009

Page 17: Scott Settembre ss424@cse.buffalo.edu CSE 734 : Cyber Physical Spaces

Scott Settembre [[email protected]] 17

Step 2.a. Normalization Techniques

• Parameter-Domain normalization– Spectral equalization (i.e. signal processing)• Dampens large variations in features by averaging over

time, useful for long utterances• Removes some speaker specific features

• Distance/Similarity-Domain normalization– Various techniques that use probabilities of known

speakers that have already been enrolled• Useful if you are doing validation

March 16, 2009

Page 18: Scott Settembre ss424@cse.buffalo.edu CSE 734 : Cyber Physical Spaces

Scott Settembre [[email protected]] 18

Step 3. Feature Extraction

• The input utterance is converted to a set of feature vectors

• Time alignment may need to be done

• Calculate similarity between each captured vector with the registered speaker template or model

March 16, 2009

Hello h he e el l lo o

h he e el l lo o

h he e el l lo o

h h .90 similarity he he .60 similarity, .75 overall

Page 19: Scott Settembre ss424@cse.buffalo.edu CSE 734 : Cyber Physical Spaces

Scott Settembre [[email protected]] 19

Side note : Analyzing speech “ah”

March 16, 2009

Waveform(Raw acoustic data)

Spectrograph(Frequency vs.Amplitude)

Formant(Continuous peakthat crossesfrequencies)

Image attributed to Dr. Douglas Roland from lecture notes describing speech recognition.

Page 20: Scott Settembre ss424@cse.buffalo.edu CSE 734 : Cyber Physical Spaces

Scott Settembre [[email protected]] 20

Step 4. Similarity Matching

• Other pattern classification techniques can be used on the normalized input

• Each speaker gets his/her own HMM, neural network, VQ codebook, etc.

• Another approach is to target specific phonemes or features– Example showing the targeting of vowel sounds, in

particular the syllable “ah”

March 16, 2009

Page 21: Scott Settembre ss424@cse.buffalo.edu CSE 734 : Cyber Physical Spaces

Scott Settembre [[email protected]] 21

Example of Vowel Comparisons

March 16, 2009

Charts attributed to Pasich, C. Speaker Identification MATLAB files, Connexions Web site. http://cnx.org/content/m14201/1.3/, Feb 16, 2007.

Page 22: Scott Settembre ss424@cse.buffalo.edu CSE 734 : Cyber Physical Spaces

Scott Settembre [[email protected]] 22

Step 5. Decision/Threshold

• For speaker identification, simply take the registered speaker template with the highest similarity score

• For speaker verification, there needs to be a minimum acceptable similarity score

March 16, 2009

Page 23: Scott Settembre ss424@cse.buffalo.edu CSE 734 : Cyber Physical Spaces

Scott Settembre [[email protected]] 23

Overview

• Speaker Identification• Speaker Validation• Two types of Recognition methods– Text dependent vs. Text independent

• Speaker Recognition steps• Conclusion / References

March 16, 2009

Page 24: Scott Settembre ss424@cse.buffalo.edu CSE 734 : Cyber Physical Spaces

Scott Settembre [[email protected]] 24

Conclusion : Why care?

• Speaker recognition will become ubiquitous– Cell phone applications – banking, security, logins– Forensic analysis (voiceprints)– Home automation (know thy user)– Google “speaker” search? (You know it’s going to

happen! )

March 16, 2009

Page 25: Scott Settembre ss424@cse.buffalo.edu CSE 734 : Cyber Physical Spaces

Scott Settembre [[email protected]] 25

References• Video links

– MIT, CSAIL. http://www.youtube.com/watch?v=0ec1Gtnlq1k– IBM, developerWorks. http://www.youtube.com/watch?v=JJ_YzBaqzAo

• Cole, Ronald A., Editor (1996) Survey of the State of the Art in Human Language Technology. http://cslu.cse.ogi.edu/HLTsurvey/HLTsurvey.html

• Iyer, Manjunath Ramachandra (2007). “Differentially Fed Artificial Neural Networks for Speech Signal Prediction.” In Hector Perez-Meana, Editor. Advances in audio and speech signal processing : technologies and applications (pp. 309-323 ) Hershey, PA : Idea Group Pub., c2007.

• Lung, Shung-Yung (2007). “Speaker Recognition.” In Hector Perez-Meana, Editor. Advances in audio and speech signal processing : technologies and applications (pp. 371-407) Hershey, PA : Idea Group Pub., c2007.

March 16, 2009