Download - 7- Speech Recognition
11
7-Speech Recognition7-Speech Recognition
Speech Recognition Concepts Speech Recognition Concepts Speech Recognition ApproachesSpeech Recognition ApproachesRecognition TheoriesRecognition TheoriesBayse RuleBayse RuleSimple Language ModelSimple Language ModelP(A|W) Network TypesP(A|W) Network Types
22
7-Speech Recognition (Cont’d)7-Speech Recognition (Cont’d)
HMM Calculating ApproachesHMM Calculating ApproachesNeural ComponentsNeural ComponentsThree Basic HMM ProblemsThree Basic HMM ProblemsViterbi AlgorithmViterbi AlgorithmState Duration ModelingState Duration ModelingTraining In HMMTraining In HMM
33
Recognition TasksRecognition TasksIsolated Word Recognition (IWR)Isolated Word Recognition (IWR)
Connected Word (CW) , And Continuous Connected Word (CW) , And Continuous Speech Recognition (CSR)Speech Recognition (CSR)Speaker Dependent, Multiple Speaker, And Speaker Dependent, Multiple Speaker, And Speaker Independent Speaker Independent Vocabulary SizeVocabulary Size– Small <20Small <20– Medium >100 , <1000Medium >100 , <1000– Large >1000, <10000Large >1000, <10000– Very Large >10000Very Large >10000
44
Speech Recognition ConceptsSpeech Recognition Concepts
NLP SpeechProcessing
Text Speech
NLPSpeech Processing
Speech Understanding
Speech Synthesis
TextPhone Sequence
Speech Recognition
Speech recognition is inverse of Speech Synthesis
55
Speech Recognition Speech Recognition ApproachesApproaches
Bottom-Up ApproachBottom-Up Approach
Top-Down ApproachTop-Down Approach
Blackboard ApproachBlackboard Approach
66
Bottom-Up ApproachBottom-Up Approach
Signal Processing
Feature Extraction
Segmentation
Signal Processing
Feature Extraction
Segmentation
Segmentation
Sound Classification Rules
Phonotactic Rules
Lexical Access
Language Model
Voiced/Unvoiced/Silence
Kno
wle
dge
Sou
rces
Recognized Utterance
77
UnitMatching
System
Top-Down ApproachTop-Down Approach
FeatureAnalysis
LexicalHypothesis
SyntacticHypothesis
SemanticHypothesis
UtteranceVerifier/Matcher
Inventory of speech
recognition units
Word Dictionary Grammar
TaskModel
Recognized Utterance
88
Blackboard ApproachBlackboard Approach
EnvironmentalProcesses
Acoustic Processes Lexical
Processes
SyntacticProcesses
SemanticProcesses
Blackboard
99
Recognition TheoriesRecognition Theories
Articulatory Based RecognitionArticulatory Based Recognition– Use from Articulatory system for recognitionUse from Articulatory system for recognition– This theory is the most successful until nowThis theory is the most successful until now
Auditory Based RecognitionAuditory Based Recognition– Use from Auditory system for recognitionUse from Auditory system for recognition
Hybrid Based RecognitionHybrid Based Recognition– Is a hybrid from the above theoriesIs a hybrid from the above theories
Motor TheoryMotor Theory– Model the intended gesture of speakerModel the intended gesture of speaker
1010
Recognition ProblemRecognition Problem
We have the sequence of acoustic We have the sequence of acoustic symbols and we want to find the words symbols and we want to find the words that expressed by speakerthat expressed by speaker
Solution : Finding the most probable of Solution : Finding the most probable of word sequence by having Acoustic word sequence by having Acoustic symbolssymbols
1111
Recognition ProblemRecognition Problem
A : Acoustic SymbolsA : Acoustic SymbolsW : Word SequenceW : Word Sequence
we should find so that we should find so that W)|(max)|ˆ( AWPAWP
W
1212
Bayse RuleBayse Rule
),()()|( yxPyPyxP
)()()|()|(
yPxPxyPyxP
)()()|()|(
APWPWAPAWP
1313
Bayse Rule (Cont’d)Bayse Rule (Cont’d)
)()()|(max
APWPWAP
W
)|(max)|ˆ( AWPAWPW
)()|(max
)|(maxˆ
WPWAPArg
AWPArgW
W
W
1414
Simple Language ModelSimple Language Modelnwwwww 321
),...,,,(),...,,|(
).....,,|(),|()|()(
)|()(
121
121
1234
123121
1211
WWWWPWWWWP
WWWWPWWWPWWPWP
wwwwPwP
nnn
nnn
iii
n
i
Computing this probability is very difficult and we need a very big database. So we use from Trigram and Bigram models.
1515
Simple Language Model Simple Language Model (Cont’d)(Cont’d)
)|()( 211 iii
n
iwwwPwP
)|()( 11 ii
n
iwwPwP
Trigram :
Bigram :
)()(1 i
n
iwPwP
Monogram :
1616
Simple Language Model Simple Language Model (Cont’d)(Cont’d)
)|( 123 wwwP
Computing Method :Number of happening W3 after W1W2
Total number of happening W1W2
AdHoc Method :)()|()|()|( 332321231123 wfwwfwwwfwwwP
1717
Error Production FactorError Production Factor
Prosody (Recognition should be Prosody (Recognition should be Prosody Independent)Prosody Independent)
Noise (Noise should be prevented)Noise (Noise should be prevented)
Spontaneous SpeechSpontaneous Speech
1818
P(A|W) Computing P(A|W) Computing ApproachesApproaches
Dynamic Time Warping (DTW)Dynamic Time Warping (DTW)
Hidden Markov Model (HMM)Hidden Markov Model (HMM)
Artificial Neural Network (ANN)Artificial Neural Network (ANN)
Hybrid SystemsHybrid Systems
Dynamic Time WarpingDynamic Time Warping
Dynamic Time WarpingDynamic Time Warping
Dynamic Time WarpingDynamic Time Warping
Dynamic Time WarpingDynamic Time Warping
Dynamic Time WarpingDynamic Time Warping
Search Limitation :Search Limitation :- First & End Interval- First & End Interval- Global Limitation- Global Limitation- Local Limitation- Local Limitation
Dynamic Time WarpingDynamic Time Warping
Global Limitation : Global Limitation :
Dynamic Time WarpingDynamic Time Warping
Local Limitation : Local Limitation :
2626
Artificial Neural NetworkArtificial Neural Network
...
1x
0x
1w 0w
1Nw1Nx
y)(
1
0
i
N
ii xwy
Simple Computation Element of a Neural Network
2727
Artificial Neural Network Artificial Neural Network (Cont’d)(Cont’d)
Neural Network TypesNeural Network Types– PerceptronPerceptron– Time DelayTime Delay– Time Delay Neural Network Computational Time Delay Neural Network Computational
Element (TDNN)Element (TDNN)
2828
Artificial Neural Network Artificial Neural Network (Cont’d)(Cont’d)
. . .
. . .
0x
0y 1My
1Nx
Single Layer Perceptron
2929
Artificial Neural Network Artificial Neural Network (Cont’d)(Cont’d)
. . .
. . .
Three Layer Perceptron
. . .
. . .
3030
2.5.4.2 Neural Network Topologies2.5.4.2 Neural Network Topologies
3131
TDNNTDNN
3232
2.5.4.6 Neural Network Structures for 2.5.4.6 Neural Network Structures for Speech RecognitionSpeech Recognition
3333
2.5.4.6 Neural Network Structures for 2.5.4.6 Neural Network Structures for
Speech RecognitionSpeech Recognition
3434
Hybrid MethodsHybrid Methods
Hybrid Neural Network and Matched Filter For Hybrid Neural Network and Matched Filter For RecognitionRecognition
PATTERN
CLASSIFIER
Speech Acoustic Features Delays
Output Units
3535
Neural Network PropertiesNeural Network Properties
The system is simple, But too much The system is simple, But too much iteration is needed for trainingiteration is needed for trainingDoesn’t determine a specific structureDoesn’t determine a specific structureRegardless of simplicity, the results are Regardless of simplicity, the results are goodgoodTraining size is large, so training should be Training size is large, so training should be offlineofflineAccuracy is relatively goodAccuracy is relatively good
Pre-processingPre-processing
Different preprocessing techniques are Different preprocessing techniques are employed as the front end for speech employed as the front end for speech recognition systemsrecognition systems
The choice of preprocessing method is The choice of preprocessing method is based on the task, the noise level, the based on the task, the noise level, the modeling tool, etc.modeling tool, etc.
3636
3838
3939
4141
4242
4343
MFCCMFCCروش روش
يي بر نحوه ادراک گوش انسان از اصوات م بر نحوه ادراک گوش انسان از اصوات ميي مبتن مبتنMFCCMFCC روش روش باشد.باشد.
بهتر بهتر يي نويز نويزييطهاطهاييژگيها در محژگيها در محيير ور ويي نسبت به سا نسبت به ساMFCCMFCC روش روش کند.کند.ييعمل معمل م
MFCCMFCCه شده ه شده يي گفتار ارا گفتار ارايييي شناسا شناسايي اساسا جهت کاربردها اساسا جهت کاربردها دارد. دارد.ييز راندمان مناسبز راندمان مناسبيينده ننده نيي گو گويييياست اما در شناسااست اما در شناسا
ر ر يي باشد که به کمک رابطه ز باشد که به کمک رابطه زيي م مMelMelدار گوش انسان دار گوش انسان يي واحد شن واحد شند:د:يي آ آييبدست مبدست م
4444
MFCCMFCCمراحل روش مراحل روش
گنال از حوزه زمان به حوزه گنال از حوزه زمان به حوزه يي: نگاشت س: نگاشت س11 مرحله مرحله زمان کوتاه. زمان کوتاه.FFTFFTفرکانس به کمک فرکانس به کمک
گنال گفتاريس : Z(n)تابع پنجره مانند پنجره :
)W(nهمينگWF= e-j2π/F
m : 0,…,F – 1;يم گفتاريطول فر : .F
4545
MFCCMFCCمراحل روش مراحل روش
لتر.لتر.يي هر کانال بانک ف هر کانال بانک فييافتن انرژافتن انرژيي: : 22مرحله مرحله
MMبر معيار مل بر معيار مل يي فيلتر مبتن فيلتر مبتنيي تعداد بانکها تعداد بانکها باشد.باشد.ييمم
بانک فيلتر بانک فيلتر ييلترهالترهايي تابع ف تابع فاست.است.0,1,..., 1k M ( )kW j
4646
توزيع فيلتر مبتنی بر معيار ملتوزيع فيلتر مبتنی بر معيار مل
4747
MFCCMFCCمراحل روش مراحل روش
DCTDCTل ل يي طيف و اعمال تبد طيف و اعمال تبديي: فشرده ساز: فشرده ساز44 مرحله مرحله MFCCMFCCب ب ييجهت حصول به ضراجهت حصول به ضرا
در رابطه باال در رابطه باالLL،،......،،00==nnب ب يي مرتبه ضرا مرتبه ضراMFCCMFCC باشد.باشد.ييمم
4848
روش مل-کپسترومروش مل-کپستروم
Mel-scaling بندی فریم
IDCT
|FFT|2
Low-order coefficientsDifferentiator
Cepstra
Delta & Delta Delta Cepstra
زمانی سیگنال
Logarithm
4949
ضرایب مل ضرایب مل ((MFCCMFCC))کپسترومکپستروم
5050
ویژگی های مل ویژگی های مل ((MFCCMFCC))کپسترومکپستروم
نگاشت انرژی های بانک فیلترمل نگاشت انرژی های بانک فیلترمل درجهتی که واریانس آنها ماکسیمم باشددرجهتی که واریانس آنها ماکسیمم باشد
((DCTDCT )با استفاده از)با استفاده ازاستقالل ویژگی های گفتار به صورت استقالل ویژگی های گفتار به صورت
((DCTDCT غیرکامل نسبت به یکدیگر)تاثیرغیرکامل نسبت به یکدیگر)تاثیرپاسخ مناسب در محیطهای تمیزپاسخ مناسب در محیطهای تمیز
کاهش کAارایی آن در محیطهای نویزیکاهش کAارایی آن در محیطهای نویزی
5151
Time-Frequency analysisTime-Frequency analysis
Short-term Fourier TransformShort-term Fourier Transform– Standard way of frequency analysis: decompose the Standard way of frequency analysis: decompose the
incoming signal into the constituent frequency components.incoming signal into the constituent frequency components.
– W(n): windowing functionW(n): windowing function– N: frame lengthN: frame length– p: step sizep: step size
5252
Critical band integrationCritical band integration
Related to masking phenomenon: the Related to masking phenomenon: the threshold of a sinusoid is elevated when its threshold of a sinusoid is elevated when its frequency is close to the center frequency of frequency is close to the center frequency of a narrow-band noisea narrow-band noise
Frequency components within a critical band Frequency components within a critical band are not resolved. Auditory system interprets are not resolved. Auditory system interprets the signals within a critical band as a wholethe signals within a critical band as a whole
5353
Bark scaleBark scale
5454
Feature orthogonalizationFeature orthogonalization
Spectral values in adjacent frequency Spectral values in adjacent frequency channels are highly correlatedchannels are highly correlatedThe correlation results in a Gaussian The correlation results in a Gaussian model with lots of parameters: have to model with lots of parameters: have to estimate all the elements of the estimate all the elements of the covariance matrixcovariance matrixDecorrelation is useful to improve the Decorrelation is useful to improve the parameter estimation.parameter estimation.