7- speech recognition

52
1 7-Speech Recognition 7-Speech Recognition Speech Recognition Concepts Speech Recognition Concepts Speech Recognition Approaches Speech Recognition Approaches Recognition Theories Recognition Theories Bayse Rule Bayse Rule Simple Language Model Simple Language Model P(A|W) Network Types P(A|W) Network Types

Upload: cyrah

Post on 21-Mar-2016

84 views

Category:

Documents


0 download

DESCRIPTION

7- Speech Recognition. Speech Recognition Concepts Speech Recognition Approaches Recognition Theories Bayse Rule Simple Language Model P(A|W) Network Types. 7- Speech Recognition (Cont’d). HMM Calculating Approaches Neural Components Three Basic HMM Problems Viterbi Algorithm - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: 7- Speech Recognition

11

7-Speech Recognition7-Speech Recognition

Speech Recognition Concepts Speech Recognition Concepts Speech Recognition ApproachesSpeech Recognition ApproachesRecognition TheoriesRecognition TheoriesBayse RuleBayse RuleSimple Language ModelSimple Language ModelP(A|W) Network TypesP(A|W) Network Types

Page 2: 7- Speech Recognition

22

7-Speech Recognition (Cont’d)7-Speech Recognition (Cont’d)

HMM Calculating ApproachesHMM Calculating ApproachesNeural ComponentsNeural ComponentsThree Basic HMM ProblemsThree Basic HMM ProblemsViterbi AlgorithmViterbi AlgorithmState Duration ModelingState Duration ModelingTraining In HMMTraining In HMM

Page 3: 7- Speech Recognition

33

Recognition TasksRecognition TasksIsolated Word Recognition (IWR)Isolated Word Recognition (IWR)

Connected Word (CW) , And Continuous Connected Word (CW) , And Continuous Speech Recognition (CSR)Speech Recognition (CSR)Speaker Dependent, Multiple Speaker, And Speaker Dependent, Multiple Speaker, And Speaker Independent Speaker Independent Vocabulary SizeVocabulary Size– Small <20Small <20– Medium >100 , <1000Medium >100 , <1000– Large >1000, <10000Large >1000, <10000– Very Large >10000Very Large >10000

Page 4: 7- Speech Recognition

44

Speech Recognition ConceptsSpeech Recognition Concepts

NLP SpeechProcessing

Text Speech

NLPSpeech Processing

Speech Understanding

Speech Synthesis

TextPhone Sequence

Speech Recognition

Speech recognition is inverse of Speech Synthesis

Page 5: 7- Speech Recognition

55

Speech Recognition Speech Recognition ApproachesApproaches

Bottom-Up ApproachBottom-Up Approach

Top-Down ApproachTop-Down Approach

Blackboard ApproachBlackboard Approach

Page 6: 7- Speech Recognition

66

Bottom-Up ApproachBottom-Up Approach

Signal Processing

Feature Extraction

Segmentation

Signal Processing

Feature Extraction

Segmentation

Segmentation

Sound Classification Rules

Phonotactic Rules

Lexical Access

Language Model

Voiced/Unvoiced/Silence

Kno

wle

dge

Sou

rces

Recognized Utterance

Page 7: 7- Speech Recognition

77

UnitMatching

System

Top-Down ApproachTop-Down Approach

FeatureAnalysis

LexicalHypothesis

SyntacticHypothesis

SemanticHypothesis

UtteranceVerifier/Matcher

Inventory of speech

recognition units

Word Dictionary Grammar

TaskModel

Recognized Utterance

Page 8: 7- Speech Recognition

88

Blackboard ApproachBlackboard Approach

EnvironmentalProcesses

Acoustic Processes Lexical

Processes

SyntacticProcesses

SemanticProcesses

Blackboard

Page 9: 7- Speech Recognition

99

Recognition TheoriesRecognition Theories

Articulatory Based RecognitionArticulatory Based Recognition– Use from Articulatory system for recognitionUse from Articulatory system for recognition– This theory is the most successful until nowThis theory is the most successful until now

Auditory Based RecognitionAuditory Based Recognition– Use from Auditory system for recognitionUse from Auditory system for recognition

Hybrid Based RecognitionHybrid Based Recognition– Is a hybrid from the above theoriesIs a hybrid from the above theories

Motor TheoryMotor Theory– Model the intended gesture of speakerModel the intended gesture of speaker

Page 10: 7- Speech Recognition

1010

Recognition ProblemRecognition Problem

We have the sequence of acoustic We have the sequence of acoustic symbols and we want to find the words symbols and we want to find the words that expressed by speakerthat expressed by speaker

Solution : Finding the most probable of Solution : Finding the most probable of word sequence by having Acoustic word sequence by having Acoustic symbolssymbols

Page 11: 7- Speech Recognition

1111

Recognition ProblemRecognition Problem

A : Acoustic SymbolsA : Acoustic SymbolsW : Word SequenceW : Word Sequence

we should find so that we should find so that W)|(max)|ˆ( AWPAWP

W

Page 12: 7- Speech Recognition

1212

Bayse RuleBayse Rule

),()()|( yxPyPyxP

)()()|()|(

yPxPxyPyxP

)()()|()|(

APWPWAPAWP

Page 13: 7- Speech Recognition

1313

Bayse Rule (Cont’d)Bayse Rule (Cont’d)

)()()|(max

APWPWAP

W

)|(max)|ˆ( AWPAWPW

)()|(max

)|(maxˆ

WPWAPArg

AWPArgW

W

W

Page 14: 7- Speech Recognition

1414

Simple Language ModelSimple Language Modelnwwwww 321

),...,,,(),...,,|(

).....,,|(),|()|()(

)|()(

121

121

1234

123121

1211

WWWWPWWWWP

WWWWPWWWPWWPWP

wwwwPwP

nnn

nnn

iii

n

i

Computing this probability is very difficult and we need a very big database. So we use from Trigram and Bigram models.

Page 15: 7- Speech Recognition

1515

Simple Language Model Simple Language Model (Cont’d)(Cont’d)

)|()( 211 iii

n

iwwwPwP

)|()( 11 ii

n

iwwPwP

Trigram :

Bigram :

)()(1 i

n

iwPwP

Monogram :

Page 16: 7- Speech Recognition

1616

Simple Language Model Simple Language Model (Cont’d)(Cont’d)

)|( 123 wwwP

Computing Method :Number of happening W3 after W1W2

Total number of happening W1W2

AdHoc Method :)()|()|()|( 332321231123 wfwwfwwwfwwwP

Page 17: 7- Speech Recognition

1717

Error Production FactorError Production Factor

Prosody (Recognition should be Prosody (Recognition should be Prosody Independent)Prosody Independent)

Noise (Noise should be prevented)Noise (Noise should be prevented)

Spontaneous SpeechSpontaneous Speech

Page 18: 7- Speech Recognition

1818

P(A|W) Computing P(A|W) Computing ApproachesApproaches

Dynamic Time Warping (DTW)Dynamic Time Warping (DTW)

Hidden Markov Model (HMM)Hidden Markov Model (HMM)

Artificial Neural Network (ANN)Artificial Neural Network (ANN)

Hybrid SystemsHybrid Systems

Page 19: 7- Speech Recognition

Dynamic Time WarpingDynamic Time Warping

Page 20: 7- Speech Recognition

Dynamic Time WarpingDynamic Time Warping

Page 21: 7- Speech Recognition

Dynamic Time WarpingDynamic Time Warping

Page 22: 7- Speech Recognition

Dynamic Time WarpingDynamic Time Warping

Page 23: 7- Speech Recognition

Dynamic Time WarpingDynamic Time Warping

Search Limitation :Search Limitation :- First & End Interval- First & End Interval- Global Limitation- Global Limitation- Local Limitation- Local Limitation

Page 24: 7- Speech Recognition

Dynamic Time WarpingDynamic Time Warping

Global Limitation : Global Limitation :

Page 25: 7- Speech Recognition

Dynamic Time WarpingDynamic Time Warping

Local Limitation : Local Limitation :

Page 26: 7- Speech Recognition

2626

Artificial Neural NetworkArtificial Neural Network

...

1x

0x

1w 0w

1Nw1Nx

y)(

1

0

i

N

ii xwy

Simple Computation Element of a Neural Network

Page 27: 7- Speech Recognition

2727

Artificial Neural Network Artificial Neural Network (Cont’d)(Cont’d)

Neural Network TypesNeural Network Types– PerceptronPerceptron– Time DelayTime Delay– Time Delay Neural Network Computational Time Delay Neural Network Computational

Element (TDNN)Element (TDNN)

Page 28: 7- Speech Recognition

2828

Artificial Neural Network Artificial Neural Network (Cont’d)(Cont’d)

. . .

. . .

0x

0y 1My

1Nx

Single Layer Perceptron

Page 29: 7- Speech Recognition

2929

Artificial Neural Network Artificial Neural Network (Cont’d)(Cont’d)

. . .

. . .

Three Layer Perceptron

. . .

. . .

Page 30: 7- Speech Recognition

3030

2.5.4.2 Neural Network Topologies2.5.4.2 Neural Network Topologies

Page 31: 7- Speech Recognition

3131

TDNNTDNN

Page 32: 7- Speech Recognition

3232

2.5.4.6 Neural Network Structures for 2.5.4.6 Neural Network Structures for Speech RecognitionSpeech Recognition

Page 33: 7- Speech Recognition

3333

2.5.4.6 Neural Network Structures for 2.5.4.6 Neural Network Structures for

Speech RecognitionSpeech Recognition

Page 34: 7- Speech Recognition

3434

Hybrid MethodsHybrid Methods

Hybrid Neural Network and Matched Filter For Hybrid Neural Network and Matched Filter For RecognitionRecognition

PATTERN

CLASSIFIER

Speech Acoustic Features Delays

Output Units

Page 35: 7- Speech Recognition

3535

Neural Network PropertiesNeural Network Properties

The system is simple, But too much The system is simple, But too much iteration is needed for trainingiteration is needed for trainingDoesn’t determine a specific structureDoesn’t determine a specific structureRegardless of simplicity, the results are Regardless of simplicity, the results are goodgoodTraining size is large, so training should be Training size is large, so training should be offlineofflineAccuracy is relatively goodAccuracy is relatively good

Page 36: 7- Speech Recognition

Pre-processingPre-processing

Different preprocessing techniques are Different preprocessing techniques are employed as the front end for speech employed as the front end for speech recognition systemsrecognition systems

The choice of preprocessing method is The choice of preprocessing method is based on the task, the noise level, the based on the task, the noise level, the modeling tool, etc.modeling tool, etc.

3636

Page 37: 7- Speech Recognition

3838

Page 38: 7- Speech Recognition

3939

Page 39: 7- Speech Recognition

4141

Page 40: 7- Speech Recognition

4242

Page 41: 7- Speech Recognition

4343

MFCCMFCCروش روش

يي بر نحوه ادراک گوش انسان از اصوات م بر نحوه ادراک گوش انسان از اصوات ميي مبتن مبتنMFCCMFCC روش روش باشد.باشد.

بهتر بهتر يي نويز نويزييطهاطهاييژگيها در محژگيها در محيير ور ويي نسبت به سا نسبت به ساMFCCMFCC روش روش کند.کند.ييعمل معمل م

MFCCMFCCه شده ه شده يي گفتار ارا گفتار ارايييي شناسا شناسايي اساسا جهت کاربردها اساسا جهت کاربردها دارد. دارد.ييز راندمان مناسبز راندمان مناسبيينده ننده نيي گو گويييياست اما در شناسااست اما در شناسا

ر ر يي باشد که به کمک رابطه ز باشد که به کمک رابطه زيي م مMelMelدار گوش انسان دار گوش انسان يي واحد شن واحد شند:د:يي آ آييبدست مبدست م

Page 42: 7- Speech Recognition

4444

MFCCMFCCمراحل روش مراحل روش

گنال از حوزه زمان به حوزه گنال از حوزه زمان به حوزه يي: نگاشت س: نگاشت س11 مرحله مرحله زمان کوتاه. زمان کوتاه.FFTFFTفرکانس به کمک فرکانس به کمک

گنال گفتاريس : Z(n)تابع پنجره مانند پنجره :

)W(nهمينگWF= e-j2π/F

m : 0,…,F – 1;يم گفتاريطول فر : .F

Page 43: 7- Speech Recognition

4545

MFCCMFCCمراحل روش مراحل روش

لتر.لتر.يي هر کانال بانک ف هر کانال بانک فييافتن انرژافتن انرژيي: : 22مرحله مرحله

MMبر معيار مل بر معيار مل يي فيلتر مبتن فيلتر مبتنيي تعداد بانکها تعداد بانکها باشد.باشد.ييمم

بانک فيلتر بانک فيلتر ييلترهالترهايي تابع ف تابع فاست.است.0,1,..., 1k M ( )kW j

Page 44: 7- Speech Recognition

4646

توزيع فيلتر مبتنی بر معيار ملتوزيع فيلتر مبتنی بر معيار مل

Page 45: 7- Speech Recognition

4747

MFCCMFCCمراحل روش مراحل روش

DCTDCTل ل يي طيف و اعمال تبد طيف و اعمال تبديي: فشرده ساز: فشرده ساز44 مرحله مرحله MFCCMFCCب ب ييجهت حصول به ضراجهت حصول به ضرا

در رابطه باال در رابطه باالLL،،......،،00==nnب ب يي مرتبه ضرا مرتبه ضراMFCCMFCC باشد.باشد.ييمم

Page 46: 7- Speech Recognition

4848

روش مل-کپسترومروش مل-کپستروم

Mel-scaling بندی فریم

IDCT

|FFT|2

Low-order coefficientsDifferentiator

Cepstra

Delta & Delta Delta Cepstra

زمانی سیگنال

Logarithm

Page 47: 7- Speech Recognition

4949

ضرایب مل ضرایب مل ((MFCCMFCC))کپسترومکپستروم

Page 48: 7- Speech Recognition

5050

ویژگی های مل ویژگی های مل ((MFCCMFCC))کپسترومکپستروم

نگاشت انرژی های بانک فیلترمل نگاشت انرژی های بانک فیلترمل درجهتی که واریانس آنها ماکسیمم باشددرجهتی که واریانس آنها ماکسیمم باشد

((DCTDCT )با استفاده از)با استفاده ازاستقالل ویژگی های گفتار به صورت استقالل ویژگی های گفتار به صورت

((DCTDCT غیرکامل نسبت به یکدیگر)تاثیرغیرکامل نسبت به یکدیگر)تاثیرپاسخ مناسب در محیطهای تمیزپاسخ مناسب در محیطهای تمیز

کاهش کAارایی آن در محیطهای نویزیکاهش کAارایی آن در محیطهای نویزی

Page 49: 7- Speech Recognition

5151

Time-Frequency analysisTime-Frequency analysis

Short-term Fourier TransformShort-term Fourier Transform– Standard way of frequency analysis: decompose the Standard way of frequency analysis: decompose the

incoming signal into the constituent frequency components.incoming signal into the constituent frequency components.

– W(n): windowing functionW(n): windowing function– N: frame lengthN: frame length– p: step sizep: step size

Page 50: 7- Speech Recognition

5252

Critical band integrationCritical band integration

Related to masking phenomenon: the Related to masking phenomenon: the threshold of a sinusoid is elevated when its threshold of a sinusoid is elevated when its frequency is close to the center frequency of frequency is close to the center frequency of a narrow-band noisea narrow-band noise

Frequency components within a critical band Frequency components within a critical band are not resolved. Auditory system interprets are not resolved. Auditory system interprets the signals within a critical band as a wholethe signals within a critical band as a whole

Page 51: 7- Speech Recognition

5353

Bark scaleBark scale

Page 52: 7- Speech Recognition

5454

Feature orthogonalizationFeature orthogonalization

Spectral values in adjacent frequency Spectral values in adjacent frequency channels are highly correlatedchannels are highly correlatedThe correlation results in a Gaussian The correlation results in a Gaussian model with lots of parameters: have to model with lots of parameters: have to estimate all the elements of the estimate all the elements of the covariance matrixcovariance matrixDecorrelation is useful to improve the Decorrelation is useful to improve the parameter estimation.parameter estimation.