7- speech recognition

7-Speech Recognition7-Speech Recognition

Speech Recognition Concepts Speech Recognition Concepts Speech Recognition ApproachesSpeech Recognition ApproachesRecognition TheoriesRecognition TheoriesBayse RuleBayse RuleSimple Language ModelSimple Language ModelP(A|W) Network TypesP(A|W) Network Types

7-Speech Recognition (Cont’d)7-Speech Recognition (Cont’d)

HMM Calculating ApproachesHMM Calculating ApproachesNeural ComponentsNeural ComponentsThree Basic HMM ProblemsThree Basic HMM ProblemsViterbi AlgorithmViterbi AlgorithmState Duration ModelingState Duration ModelingTraining In HMMTraining In HMM

Recognition TasksRecognition TasksIsolated Word Recognition (IWR)Isolated Word Recognition (IWR)

Connected Word (CW) , And Continuous Connected Word (CW) , And Continuous Speech Recognition (CSR)Speech Recognition (CSR)Speaker Dependent, Multiple Speaker, And Speaker Dependent, Multiple Speaker, And Speaker Independent Speaker Independent Vocabulary SizeVocabulary Size– Small <20Small <20– Medium >100 , <1000Medium >100 , <1000– Large >1000, <10000Large >1000, <10000– Very Large >10000Very Large >10000

Speech Recognition ConceptsSpeech Recognition Concepts

NLP SpeechProcessing

Text Speech

NLPSpeech Processing

Speech Understanding

Speech Synthesis

TextPhone Sequence

Speech Recognition

Speech recognition is inverse of Speech Synthesis

Speech Recognition Speech Recognition ApproachesApproaches

Bottom-Up ApproachBottom-Up Approach

Top-Down ApproachTop-Down Approach

Blackboard ApproachBlackboard Approach

Bottom-Up ApproachBottom-Up Approach

Signal Processing

Feature Extraction

Segmentation

Signal Processing

Feature Extraction

Segmentation

Sound Classification Rules

Phonotactic Rules

Lexical Access

Language Model

Voiced/Unvoiced/Silence

Recognized Utterance

UnitMatching

System

Top-Down ApproachTop-Down Approach

FeatureAnalysis

LexicalHypothesis

SyntacticHypothesis

SemanticHypothesis

UtteranceVerifier/Matcher

Inventory of speech

recognition units

Word Dictionary Grammar

TaskModel

Recognized Utterance

Blackboard ApproachBlackboard Approach

EnvironmentalProcesses

Acoustic Processes Lexical

Processes

SyntacticProcesses

SemanticProcesses

Blackboard

Recognition TheoriesRecognition Theories

Articulatory Based RecognitionArticulatory Based Recognition– Use from Articulatory system for recognitionUse from Articulatory system for recognition– This theory is the most successful until nowThis theory is the most successful until now

Auditory Based RecognitionAuditory Based Recognition– Use from Auditory system for recognitionUse from Auditory system for recognition

Hybrid Based RecognitionHybrid Based Recognition– Is a hybrid from the above theoriesIs a hybrid from the above theories

Motor TheoryMotor Theory– Model the intended gesture of speakerModel the intended gesture of speaker

Recognition ProblemRecognition Problem

We have the sequence of acoustic We have the sequence of acoustic symbols and we want to find the words symbols and we want to find the words that expressed by speakerthat expressed by speaker

Solution : Finding the most probable of Solution : Finding the most probable of word sequence by having Acoustic word sequence by having Acoustic symbolssymbols

Recognition ProblemRecognition Problem

A : Acoustic SymbolsA : Acoustic SymbolsW : Word SequenceW : Word Sequence

we should find so that we should find so that W)|(max)|ˆ( AWPAWP

Bayse RuleBayse Rule

),()()|( yxPyPyxP

)()()|()|(

yPxPxyPyxP

)()()|()|(

APWPWAPAWP

Bayse Rule (Cont’d)Bayse Rule (Cont’d)

)()()|(max

APWPWAP

)|(max)|ˆ( AWPAWPW

)()|(max

)|(maxˆ

WPWAPArg

AWPArgW

Simple Language ModelSimple Language Modelnwwwww 321

),...,,,(),...,,|(

).....,,|(),|()|()(

123121

WWWWPWWWWP

WWWWPWWWPWWPWP

wwwwPwP

Computing this probability is very difficult and we need a very big database. So we use from Trigram and Bigram models.

Simple Language Model Simple Language Model (Cont’d)(Cont’d)

)|()( 211 iii

iwwwPwP

)|()( 11 ii

iwwPwP

Trigram :

Bigram :

)()(1 i

Monogram :

Simple Language Model Simple Language Model (Cont’d)(Cont’d)

)|( 123 wwwP

Computing Method :Number of happening W3 after W1W2

Total number of happening W1W2

AdHoc Method :)()|()|()|( 332321231123 wfwwfwwwfwwwP

Error Production FactorError Production Factor

Prosody (Recognition should be Prosody (Recognition should be Prosody Independent)Prosody Independent)

Noise (Noise should be prevented)Noise (Noise should be prevented)

Spontaneous SpeechSpontaneous Speech

P(A|W) Computing P(A|W) Computing ApproachesApproaches

Dynamic Time Warping (DTW)Dynamic Time Warping (DTW)

Hidden Markov Model (HMM)Hidden Markov Model (HMM)

Artificial Neural Network (ANN)Artificial Neural Network (ANN)

Hybrid SystemsHybrid Systems

Dynamic Time WarpingDynamic Time Warping

Search Limitation :Search Limitation :- First & End Interval- First & End Interval- Global Limitation- Global Limitation- Local Limitation- Local Limitation

Global Limitation : Global Limitation :

Local Limitation : Local Limitation :

Artificial Neural NetworkArtificial Neural Network

1Nw1Nx

ii xwy

Simple Computation Element of a Neural Network

Artificial Neural Network Artificial Neural Network (Cont’d)(Cont’d)

Neural Network TypesNeural Network Types– PerceptronPerceptron– Time DelayTime Delay– Time Delay Neural Network Computational Time Delay Neural Network Computational

Element (TDNN)Element (TDNN)

0y 1My

Single Layer Perceptron

Three Layer Perceptron

2.5.4.2 Neural Network Topologies2.5.4.2 Neural Network Topologies

TDNNTDNN

2.5.4.6 Neural Network Structures for 2.5.4.6 Neural Network Structures for Speech RecognitionSpeech Recognition

2.5.4.6 Neural Network Structures for 2.5.4.6 Neural Network Structures for

Speech RecognitionSpeech Recognition

Hybrid MethodsHybrid Methods

Hybrid Neural Network and Matched Filter For Hybrid Neural Network and Matched Filter For RecognitionRecognition

PATTERN

CLASSIFIER

Speech Acoustic Features Delays

Output Units

Neural Network PropertiesNeural Network Properties

The system is simple, But too much The system is simple, But too much iteration is needed for trainingiteration is needed for trainingDoesn’t determine a specific structureDoesn’t determine a specific structureRegardless of simplicity, the results are Regardless of simplicity, the results are goodgoodTraining size is large, so training should be Training size is large, so training should be offlineofflineAccuracy is relatively goodAccuracy is relatively good

Pre-processingPre-processing

Different preprocessing techniques are Different preprocessing techniques are employed as the front end for speech employed as the front end for speech recognition systemsrecognition systems

The choice of preprocessing method is The choice of preprocessing method is based on the task, the noise level, the based on the task, the noise level, the modeling tool, etc.modeling tool, etc.

MFCCMFCCروش روش

يي بر نحوه ادراک گوش انسان از اصوات م بر نحوه ادراک گوش انسان از اصوات ميي مبتن مبتنMFCCMFCC روش روش باشد.باشد.

بهتر بهتر يي نويز نويزييطهاطهاييژگيها در محژگيها در محيير ور ويي نسبت به سا نسبت به ساMFCCMFCC روش روش کند.کند.ييعمل معمل م

MFCCMFCCه شده ه شده يي گفتار ارا گفتار ارايييي شناسا شناسايي اساسا جهت کاربردها اساسا جهت کاربردها دارد. دارد.ييز راندمان مناسبز راندمان مناسبيينده ننده نيي گو گويييياست اما در شناسااست اما در شناسا

ر ر يي باشد که به کمک رابطه ز باشد که به کمک رابطه زيي م مMelMelدار گوش انسان دار گوش انسان يي واحد شن واحد شند:د:يي آ آييبدست مبدست م

MFCCMFCCمراحل روش مراحل روش

گنال از حوزه زمان به حوزه گنال از حوزه زمان به حوزه يي: نگاشت س: نگاشت س11 مرحله مرحله زمان کوتاه. زمان کوتاه.FFTFFTفرکانس به کمک فرکانس به کمک

گنال گفتاريس : Z(n)تابع پنجره مانند پنجره :

)W(nهمينگWF= e-j2π/F

m : 0,…,F – 1;يم گفتاريطول فر : .F

لتر.لتر.يي هر کانال بانک ف هر کانال بانک فييافتن انرژافتن انرژيي: : 22مرحله مرحله

MMبر معيار مل بر معيار مل يي فيلتر مبتن فيلتر مبتنيي تعداد بانکها تعداد بانکها باشد.باشد.ييمم

بانک فيلتر بانک فيلتر ييلترهالترهايي تابع ف تابع فاست.است.0,1,..., 1k M ( )kW j

توزيع فيلتر مبتنی بر معيار ملتوزيع فيلتر مبتنی بر معيار مل

DCTDCTل ل يي طيف و اعمال تبد طيف و اعمال تبديي: فشرده ساز: فشرده ساز44 مرحله مرحله MFCCMFCCب ب ييجهت حصول به ضراجهت حصول به ضرا

در رابطه باال در رابطه باالLL،،......،،00==nnب ب يي مرتبه ضرا مرتبه ضراMFCCMFCC باشد.باشد.ييمم

روش مل-کپسترومروش مل-کپستروم

Mel-scaling بندی فریم

|FFT|2

Low-order coefficientsDifferentiator

Cepstra

Delta & Delta Delta Cepstra

زمانی سیگنال

Logarithm

ضرایب مل ضرایب مل ((MFCCMFCC))کپسترومکپستروم

ویژگی های مل ویژگی های مل ((MFCCMFCC))کپسترومکپستروم

نگاشت انرژی های بانک فیلترمل نگاشت انرژی های بانک فیلترمل درجهتی که واریانس آنها ماکسیمم باشددرجهتی که واریانس آنها ماکسیمم باشد

((DCTDCT )با استفاده از)با استفاده ازاستقالل ویژگی های گفتار به صورت استقالل ویژگی های گفتار به صورت

((DCTDCT غیرکامل نسبت به یکدیگر)تاثیرغیرکامل نسبت به یکدیگر)تاثیرپاسخ مناسب در محیطهای تمیزپاسخ مناسب در محیطهای تمیز

کاهش کAارایی آن در محیطهای نویزیکاهش کAارایی آن در محیطهای نویزی

Time-Frequency analysisTime-Frequency analysis

Short-term Fourier TransformShort-term Fourier Transform– Standard way of frequency analysis: decompose the Standard way of frequency analysis: decompose the

incoming signal into the constituent frequency components.incoming signal into the constituent frequency components.

– W(n): windowing functionW(n): windowing function– N: frame lengthN: frame length– p: step sizep: step size

Critical band integrationCritical band integration

Related to masking phenomenon: the Related to masking phenomenon: the threshold of a sinusoid is elevated when its threshold of a sinusoid is elevated when its frequency is close to the center frequency of frequency is close to the center frequency of a narrow-band noisea narrow-band noise

Frequency components within a critical band Frequency components within a critical band are not resolved. Auditory system interprets are not resolved. Auditory system interprets the signals within a critical band as a wholethe signals within a critical band as a whole

Bark scaleBark scale

Feature orthogonalizationFeature orthogonalization

Spectral values in adjacent frequency Spectral values in adjacent frequency channels are highly correlatedchannels are highly correlatedThe correlation results in a Gaussian The correlation results in a Gaussian model with lots of parameters: have to model with lots of parameters: have to estimate all the elements of the estimate all the elements of the covariance matrixcovariance matrixDecorrelation is useful to improve the Decorrelation is useful to improve the parameter estimation.parameter estimation.

7- speech recognition

speech recognition contdhmm

recognition problemwe

recognition problema

neural network topologies

neural network structu

preventedspontaneous

simple language modelcomputing

word cw

Documents

7- speech recognition

speech recognition and speech translation

speech recognition:

cs 136a lecture 7 speech recognition architecture: training...

7- speech recognition (cont’d)

1 7-speech recognition speech recognition concepts speech...

spem: modeling human speech recognition - mrc ... · web...

automatic speech recognition and speech variability: a...

speech and speech recognition resources

design and implementation of speech recognition systems ·...

information for speech recognition joint processing of ......

8-speech recognition speech recognition concepts speech...

landmark-based speech recognition: status report, 7/21/2004

issues in speech recognition shraddha sharma. contents:...

an overview of basics speech recognition and autonomous...

artificial intelligence for speech recognition. introduction...

chapter 5: speech recognition an example of a speech...

interaction speech recognition technical reference ·...

speech recognition

speech recognition. what makes speech recognition hard?