Automatic Speaker Recognition system using Mel Frequency Cepstral Coefficients (MFCC) and Vector Quantization (VQ)
approachPresented by:
Md. Abdullah-al-MAMUN
1
OUTLINEOUTLINE What is speaker recognition ?What is speaker recognition ?
Speaker Identification Speaker Identification Speaker VerificationSpeaker Verification
The Structure of Speaker Recognizer The Structure of Speaker Recognizer Feature Extraction : Feature Extraction : MFCCMFCC Speech Signal to Vector Quantization Speech Signal to Vector Quantization ((VQVQ)) Database Creation ProcessDatabase Creation Process Speaker IdentificationSpeaker Identification Speaker VerificationSpeaker VerificationTable :Table : Speaker Recognition Result Speaker Recognition Result ApplicationsApplications ConclusionConclusion ReferencesReferences
2
What is What is SSpeaker peaker RRecognitionecognition??
Speaker Recognition is the process Speaker Recognition is the process of automatically recognizing who is of automatically recognizing who is speaking on the basis of individual speaking on the basis of individual information included in speech information included in speech signals. signals.
3
Speaker Recognition =
Speaker Identification, Speaker Verification
Speaker Identification
Whose voice is this?
??
??
4
Speaker Verification
• Synonyms: authentication, detection.• User claims an identity.• System task: Accept or reject identity claim.
Is this Ahmad’s voice
?
?
5
Model of Model of Speaker Speaker RecognizerRecognizer
6
Fig -1 : Simple model of Speaker Recognizer .
U Permitted to Access
Hello,Mr. John
The Structure of The Structure of Speaker Speaker RecognizerRecognizer
Figure 2 :Functional Scheme of an ASR System.Figure 2 :Functional Scheme of an ASR System.
7
Feature Extraction Feature VectorFeature Vector
Training ModeTraining Mode
RecognitionRecognition
Speaker Modeling
Classification
Decision Logic Speaker
#ID
Speaker_1Speaker_1
Speech Signal AnalysisSpeech Signal Analysis
FFeature eature EExtractionxtraction- The aim is to extract the voice - The aim is to extract the voice features to distinguish different features to distinguish different phonemes of a language.phonemes of a language.
8
515645465
156156165
156456454
251561565
MFCCMFCC extractionextraction
Pre-emphasis DFTMel filter
banks Log(||2) IDFT
Speech
signalx(n)
WINDOW
x’(n)
xt (n)
Xt(k)
Yt(m)
MFCCyt(m)(k)
9
MFCC means Mel-frequency cepstral coefficients that representation of the short-term power spectrum of a sound for audio processing.
The MFCCs are the amplitudes of the resulting spectrum.
Speech waveform Speech waveform of a phoneme “\of a phoneme “\
ae”ae”
After pre-emphasis After pre-emphasis and Hamming and Hamming
windowingwindowing
Power spectrumPower spectrum MFCCMFCC
Explanatory ExampleExplanatory Example
10
Speech SignalSpeech Signal to to Feature Feature VectorVector
11
515645465
156156165
156456454
251561565
Feature VectorFeature Vector to to ClassificationClassification
Vector Quantization (VQ)
12
Vector Quantization (VQ)
AIM of VQ :representation of large amounts
of data by (few) prototype vectors.
example:
identification and grouping
in clusters of similar data.
assignment of feature vector to the closest prototype w
(similarity or distance measure,
e.g. Euclidean distance )
DDatabase atabase CCreation reation PProcessrocess
13
Database
Speaker #1
Speaker #2
Speaker #3
Hello, Speaker #1
Speaker #1
Speaker #1
Speaker #2
Speaker #2
Hello, Speaker #2
SSpeaker peaker IIdentificationdentification
Database
#1
#2
#3
Speaker
# ?
Speaker #
1
14
SSpeaker peaker VVerificationerification
Database
#1
#2
#3
Speaker #
1Accep
t
15
DDatabaseatabase C Creationreation CConditionondition
16
Table 1: Database description.
Parameter Characteristics
Language BanglaNo. of speaker 5Speech type Sentence reading Recording condition A normal room conditionAudio Length 60-90 secondsAudio type StereoSample Format 16-bit PCMSampling Frequency 8 KHzBit Rate 1411 kbps
SSpeakerpeaker R Recognitionecognition RResultesult
17
Table 3: Test result for speaker recognition system.
Speaker No. of input Correct Incorrect Accuracy
Speaker_1 5 5 0 100%
Speaker_2 9 8 1 88.88%
Speaker_3 6 6 0 100%
Speaker_3 12 11 1 91.67%
Speaker_4 8 8 0 100%
Speaker_5 10 10 0 100%
Total Speaker 50 48 2 96%
Applications
• Transaction authentication– Toll fraud prevention– Telephone credit card purchases– Telephone brokerage (e.g., stock trading)
• Access control– Physical facilities– Computers and data networks
• Information retrieval– Customer information for call centers– Audio indexing (speech skimming device)
• Forensics– Voice sample matching
18
ConclusionsConclusions 100% accuracy achievement is really 100% accuracy achievement is really
difficult whereas our proposal difficult whereas our proposal system achieve 96% accuracy for system achieve 96% accuracy for limited resources (limited resources (speaker & utterancespeaker & utterance)). .
You should avoided poor quality You should avoided poor quality microphone to get better accuracy.microphone to get better accuracy.
Training the recognizer will provide Training the recognizer will provide an even better experience.an even better experience.
19
Thank YouThank You
20