digital voice analysis

DIGITAL VOICE ANALYSIS

Prepared by

GAURAV MISHRA

BHUMIKA DWIVEDI

AKASH RAJAN RAI

KARTIC KUMAR

2

Table of Contents

CONCLUSION

FUTURE WORK

FEATURE EXTRACTION

SPEAKER VERIFICATION

INTRODUCTION

3

Voice analysis- is the study of speech sounds for purposes other than linguistic content, such as in speech recognition.

include mostly medical analysis of the voice i.e. phoniatrics, but also speaker identification.

Speaker recognition process of identifying a person from a spoken phrase allows for a secure method of authenticating speakers Applications include:

voice dialing, banking over a telephone network, security control for confidential information, etc

Challenges Can be imitated to a certain degree Need to capture discriminating features Emotional physical states affect quality

DIGITAL VOICE ANALYSIS -INTRODUCTION

4

SPEECH PROCESSING TAXONOMY

Recognition

SpeechRecognition

Speaker Recognition

LanguageRecognition

Speaker Identification

Speaker Verification

Text-dependent

Closed-set

Text-independent

Closed-set

Text-dependent

Closed-set

Text-independent

Open-set

• Determine whether person is who they claim to be

• User makes identity claim: one to one mapping

• Unknown voice could come from large set of unknown speakers - referred to as open-set verification

5

SPEAKER IDENTIFICATION

?

Is this Kartic’s voice?

6

ACCEPT

GENERAL THEORY OF SPEAKER VERIFICATION SYSTEM

Feature extraction

Feature extraction

SpeakerModel

SpeakerModel

Mishra’s “Voiceprint”

“My Name is Mishra”

ACCEPT

Mishra

ImpostorModel

ImpostorModel

Identity Claim

DecisionDecision

REJECT

Input Speech

Impostor “Voiceprints”

7

Two distinct phases to any speaker verification system

Feature extraction

Feature extraction

Model training

Model training

Enrollment speech for each speaker

Akash

Bhumika

Voiceprints (models) for each speaker

Bhumika

Akash

Enrollment Enrollment PhasePhase

Model training

Model training

Accepted!Feature extraction

Feature extraction

Verificationdecision


Claimed identity: Bhumika

Verification Verification PhasePhase



8

TRAINING PHASE

1st phase of SIS is Enrollment Sessions also known as Training Phase.

During training phase, the SIS generates a speaker model which is based on the speaker’s characteristics.

Front End Processing

FeatureVectors

Speaker Database

Speaker Modeling

SpeakerModels

Speaker 1

Speaker 3

Speaker

2

9

There are three main components of SI System:

Front-end Processing

Speaker Modeling

Pattern Matching and Classification

COMPONENTS OF SPEAKER IDENTIFICATION SYSTEM

10

FRONT-END PROCESSING

Front-end Processing generally consists of three sub-processes Preprocessing

Removal of Noise / Silence from SpeechFrame BlockingWindowing

Feature Extraction

'the curse of the dimensionality' the number of training/test-vectors needed for a classification problem grows exponential with the dimension of the given input-vector- feature extraction is needed.

Transform the speech signal into compact effective representationMore stable and discriminative than the original signal

11

PRE-PROCESSING

The speech signal is a slowly timed varying signal called quasi-stationary that is when the signal is examined over a short period of time (5-100msec), the signal is fairly stationary.

Speech signals are often analyzed in short time segments referred to as short-time spectral analysis typically 20-30 msec frames that overlap each other with 30-50%. This is done in order not to lose any information due to the windowing.

Duration of each frame is 23 ms for sampling frequencies 11025 Hz, and a new frame contains the last 11.5 ms of the previous frame’s data. For the sampling frequency 8000 Hz, duration of each frame is 16 ms and a new frame contains the last 8 ms of the previous frame’s data

12

WINDOWING

After the signal has been framed, window each individual frame so as to minimize the signal discontinuities at the beginning and end of each frame.

Each frame is multiplied with a window function w(n) with length N, where N is the length of the frame.

Typically the Hamming window is used. It preserves higher order harmonics and avoid problems due to truncation of the signal.

13

MFCC

Continous

Speech

Frame

Spectrum

Mel

Weighted Spectrum

Mel Cepstral Coefficients

Frame Blocking Windowing

FFT

Mel Filter Bank

Log Compression

DCT

Feature Extracted Coefficients

Fast Fourier Transform (FFT)

converts each frame of N samples from the time domain into the frequency domain.

defined on the set of N samples {xn}, as follow:

14

1

0

/2 1,...,2,1,0,N

n

Nknjnk NkexX

Mel-frequency Wrapping

15

0 1000 2000 3000 4000 5000 6000 7000 0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2 Mel-spaced filterbank

Frequency (Hz)

Cepstrum

convert the log mel spectrum back to time called the mel frequency cepstrum coefficients (MFCC) we denote those mel power spectrum coefficients that

are the result of the last step are , we can calculate the MFCC's, as

16

17

PATTERN MATCHING AND CLASSIFICATION

The classifiers used for speaker identification can be grouped into two major types:

Template-based and Stochastic model based classifiers

Template based classifiers are considered to be the simplest classifiers.

Dynamic Time Warping (useful for text-dependent speaker recognition) Vector Quantization (useful for text-independent speaker recognition)

Stochastic models provide more flexibility and better results.

Gaussian Mixture Model (useful for text-independent speaker recognition), the Hidden Markov model (useful for text-dependent speaker recognition), and Neural Networks to model a speaker's acoustic space.

18

SPEAKER MODELING-VQ

Vector QuantizationIt is not possible to use all the feature vectors of a given speaker occurring in the training data to form the speaker's model. Because there are too many feature vectors for each speaker.

A method of reducing/compressing the number of training vectors is required to form a codebook consisting of a small number of highly representative vectors that efficiently represent the speaker-specific characteristics.

VQ is the process of mapping feature vectors in a vector space into a finite number of regions in that space. Each region is called a cluster and each cluster is represented by its centroid. The collection of all centroids is called codebook

19

We would develop selected algorithms related to speaker identification in MATLAB. The implementation would be modular and will be done keeping in view the real time implementation. The complete real time implementation though is not in the scope of the project

We would test and verify all the performance level of the algorithms. For this purpose the data collected will be divided into training and testing data (70% training, 30% testing).

For hardware implementation we would either use interfacing or some digital signal processor.

FUTURE WORK

CONCLUSION

Speaker verification is one of the few recognition areas where machines can outperform humans

Speaker verification technology is a viable technique currently available for applications

Speaker verification can be augmented with other authentication techniques to add further security

20

digital voice analysis

Documents